Researchers warn we could run out of data to train AI by 2026. What then?


Article by Rita Matulionyte: “As artificial intelligence (AI) reaches the peak of its popularity, researchers have warned the industry might be running out of training data – the fuel that runs powerful AI systems. This could slow down the growth of AI models, especially large language models, and may even alter the trajectory of the AI revolution.

But why is a potential lack of data an issue, considering how much there are on the web? And is there a way to address the risk?…

We need a lot of data to train powerful, accurate and high-quality AI algorithms. For instance, ChatGPT was trained on 570 gigabytes of text data, or about 300 billion words.

Similarly, the stable diffusion algorithm (which is behind many AI image-generating apps such as DALL-E, Lensa and Midjourney) was trained on the LIAON-5B dataset comprising of 5.8 billion image-text pairs. If an algorithm is trained on an insufficient amount of data, it will produce inaccurate or low-quality outputs.

The quality of the training data is also important…This is why AI developers seek out high-quality content such as text from books, online articles, scientific papers, Wikipedia, and certain filtered web content. The Google Assistant was trained on 11,000 romance novels taken from self-publishing site Smashwords to make it more conversational.

The AI industry has been training AI systems on ever-larger datasets, which is why we now have high-performing models such as ChatGPT or DALL-E 3. At the same time, research shows online data stocks are growing much slower than datasets used to train AI.

In a paper published last year, a group of researchers predicted we will run out of high-quality text data before 2026 if the current AI training trends continue. They also estimated low-quality language data will be exhausted sometime between 2030 and 2050, and low-quality image data between 2030 and 2060.

AI could contribute up to US$15.7 trillion (A$24.1 trillion) to the world economy by 2030, according to accounting and consulting group PwC. But running out of usable data could slow down its development…(More)”.

Democratic Policy Development using Collective Dialogues and AI


Paper by Andrew Konya, Lisa Schirch, Colin Irwin, Aviv Ovadya: “We design and test an efficient democratic process for developing policies that reflect informed public will. The process combines AI-enabled collective dialogues that make deliberation democratically viable at scale with bridging-based ranking for automated consensus discovery. A GPT4-powered pipeline translates points of consensus into representative policy clauses from which an initial policy is assembled. The initial policy is iteratively refined with the input of experts and the public before a final vote and evaluation. We test the process three times with the US public, developing policy guidelines for AI assistants related to medical advice, vaccine information, and wars & conflicts. We show the process can be run in two weeks with 1500+ participants for around $10,000, and that it generates policy guidelines with strong public support across demographic divides. We measure 75-81% support for the policy guidelines overall, and no less than 70-75% support across demographic splits spanning age, gender, religion, race, education, and political party. Overall, this work demonstrates an end-to-end proof of concept for a process we believe can help AI labs develop common-ground policies, governing bodies break political gridlock, and diplomats accelerate peace deals…(More)”.

Matchmaking Research To Policy: Introducing Britain’s Areas Of Research Interest Database


Article by Kathryn Oliver: “Areas of research interest (ARIs) were originally recommended in the 2015 Nurse Review, which argued that if government stated what it needed to know more clearly and more regularly, then it would be easier for policy-relevant research to be produced.

During our time in government, myself and Annette Boaz worked to develop these areas of research interest, mobilize experts and produce evidence syntheses and other outputs addressing them, largely in response to the COVID pandemic. As readers of this blog will know, we have learned a lot about what it takes to mobilize evidence – the hard, and often hidden labor of creating and sustaining relationships, being part of transient teams, managing group dynamics, and honing listening and diplomatic skills.

Some of the challenges we encountered include the oft-cited, cultural gap between research and policy, the relevance of evidence, and the difficulty in resourcing knowledge mobilization and evidence synthesis require systemic responses. However, one challenge, the information gap noted by Nurse, between researchers and what government departments actually want to know offered a simpler solution.

Up until September 2023, departmental ARIs were published on gov.uk, in pdf or html format. Although a good start, we felt that having all the ARIs in one searchable database would make them more interactive and accessible. So, working with Overton, we developed the new ARI database. The primary benefit of the database will be to raise awareness of ARIs (through email alerts about new ARIs) and accessibility (by holding all ARIs in one place which is easily searchable)…(More)”.

What Is Public Trust in the Health System? Insights into Health Data Use


Open Access Book by Felix Gille: “This book explores the concept of public trust in health systems.

In the context of recent events, including public response to interventions to tackle the COVID-19 pandemic, vaccination uptake and the use of health data and digital health, this important book uses empirical evidence to address why public trust is vital to a well-functioning health system.

In doing so, it provides a comprehensive contemporary explanation of public trust, how it affects health systems and how it can be nurtured and maintained as an integral component of health system governance…(More)”.

Chatbots May ‘Hallucinate’ More Often Than Many Realize


Cade Metz at The New York Times: “When the San Francisco start-up OpenAI unveiled its ChatGPT online chatbot late last year, millions were wowed by the humanlike way it answered questions, wrote poetry and discussed almost any topic. But most people were slow to realize that this new kind of chatbot often makes things up.

When Google introduced a similar chatbot several weeks later, it spewed nonsense about the James Webb telescope. The next day, Microsoft’s new Bing chatbot offered up all sorts of bogus information about the Gap, Mexican nightlife and the singer Billie Eilish. Then, in March, ChatGPT cited a half dozen fake court cases while writing a 10-page legal brief that a lawyer submitted to a federal judge in Manhattan.

Now a new start-up called Vectara, founded by former Google employees, is trying to figure out how often chatbots veer from the truth. The company’s research estimates that even in situations designed to prevent it from happening, chatbots invent information at least 3 percent of the time — and as high as 27 percent.

Experts call this chatbot behavior “hallucination.” It may not be a problem for people tinkering with chatbots on their personal computers, but it is a serious issue for anyone using this technology with court documents, medical information or sensitive business data.

Because these chatbots can respond to almost any request in an unlimited number of ways, there is no way of definitively determining how often they hallucinate. “You would have to look at all of the world’s information,” said Simon Hughes, the Vectara researcher who led the project…(More)”.

Climate data can save lives. Most countries can’t access it.


Article by Zoya Teirstein: “Earth just experienced one of its hottest, and most damaging, periods on record. Heat waves in the United States, Europe, and China; catastrophic flooding in IndiaBrazilHong Kong, and Libya; and outbreaks of malaria, dengue, and other mosquito-borne illnesses across southern Asia claimed tens of thousands of lives. The vast majority of these deaths could have been averted with the right safeguards in place.

The World Meteorological Organization, or WMO, published a report last week that shows just 11 percent of countries have the full arsenal of tools required to save lives as the impacts of climate change — including deadly weather events, infectious diseases, and respiratory illnesses like asthma — become more extreme. The United Nations climate agency predicts that significant natural disasters will hit the planet 560 times per year by the end of this decade. What’s more, countries that lack early warning systems, such as extreme heat alerts, will see eight times more climate-related deaths than countries that are better prepared. By midcentury, some 50 percent of these deaths will take place in Africa, a continent that is responsible for around 4 percent of the world’s greenhouse gas emissions each year…(More)”.

Smart City Data Governance


OECD Report: “Smart cities leverage technologies, in particular digital, to generate a vast amount of real-time data to inform policy- and decision-making for an efficient and effective public service delivery. Their success largely depends on the availability and effective use of data. However, the amount of data generated is growing more rapidly than governments’ capacity to store and process them, and the growing number of stakeholders involved in data production, analysis and storage pushes cities data management capacity to the limit. Despite the wide range of local and national initiatives to enhance smart city data governance, urban data is still a challenge for national and city governments due to: insufficient financial resources; lack of business models for financing and refinancing of data collection; limited access to skilled experts; the lack of full compliance with the national legislation on data sharing and protection; and data and security risks. Facing these challenges is essential to managing and sharing data sensibly if cities are to boost citizens’ well-being and promote sustainable environments…(More)”

Assessing and Suing an Algorithm


Report by Elina Treyger, Jirka Taylor, Daniel Kim, and Maynard A. Holliday: “Artificial intelligence algorithms are permeating nearly every domain of human activity, including processes that make decisions about interests central to individual welfare and well-being. How do public perceptions of algorithmic decisionmaking in these domains compare with perceptions of traditional human decisionmaking? What kinds of judgments about the shortcomings of algorithmic decisionmaking processes underlie these perceptions? Will individuals be willing to hold algorithms accountable through legal channels for unfair, incorrect, or otherwise problematic decisions?

Answers to these questions matter at several levels. In a democratic society, a degree of public acceptance is needed for algorithms to become successfully integrated into decisionmaking processes. And public perceptions will shape how the harms and wrongs caused by algorithmic decisionmaking are handled. This report shares the results of a survey experiment designed to contribute to researchers’ understanding of how U.S. public perceptions are evolving in these respects in one high-stakes setting: decisions related to employment and unemployment…(More)”.

Can Large Language Models Capture Public Opinion about Global Warming? An Empirical Assessment of Algorithmic Fidelity and Bias


Paper by S. Lee et all: “Large language models (LLMs) have demonstrated their potential in social science research by emulating human perceptions and behaviors, a concept referred to as algorithmic fidelity. This study assesses the algorithmic fidelity and bias of LLMs by utilizing two nationally representative climate change surveys. The LLMs were conditioned on demographics and/or psychological covariates to simulate survey responses. The findings indicate that LLMs can effectively capture presidential voting behaviors but encounter challenges in accurately representing global warming perspectives when relevant covariates are not included. GPT-4 exhibits improved performance when conditioned on both demographics and covariates. However, disparities emerge in LLM estimations of the views of certain groups, with LLMs tending to underestimate worry about global warming among Black Americans. While highlighting the potential of LLMs to aid social science research, these results underscore the importance of meticulous conditioning, model selection, survey question format, and bias assessment when employing LLMs for survey simulation. Further investigation into prompt engineering and algorithm auditing is essential to harness the power of LLMs while addressing their inherent limitations…(More)”.

Unintended Consequences of Data-driven public participation: How Low-Traffic Neighborhood planning became polarized


Paper by Alison Powell: “This paper examines how data-driven consultation contributes to dynamics of political polarization, using the case of ‘Low-Traffic Neighborhoods’ in London, UK. It explores how data-driven consultation can facilitate participation, including ‘agonistic data practices” (Crooks and Currie, 2022) that challenge the dominant interpretations of digital data. The paper adds empirical detail to previous studies of agonistic data practices, concluding that agonistic data practices require certain normative conditions to be met, otherwise dissenting data practices can contribute to dynamics of polarization. The results of this paper draw on empirical insights from the political context of the UK to explain how ostensibly democratic processes including data-driven consultation establish some kinds of knowledge as more legitimate than others. Apparently ‘objective’ knowledge, or calculable data, is attributed greater legitimacy than strong feelings or affective narratives. This can displace affective responses to policy decisions into insular social media spaces where polarizing dynamics are at play. Affective polarization, where political difference is solidified through appeals to feeling, creates political distance and the dehumanization of ‘others’. This can help to amplify conspiracy theories that pose risks to democracy and to the overall legitimacy of media environments. These tendencies are exacerbated when processes of consultation prescribe narrow or specific contributions, valorize quantifiable or objective data and create limited room for dissent…(More)”