Frontier AI: double-edged sword for public sector


Article by Zeynep Engin: “The power of the latest AI technologies, often referred to as ‘frontier AI’, lies in their ability to automate decision-making by harnessing complex statistical insights from vast amounts of unstructured data, using models that surpass human understanding. The introduction of ChatGPT in late 2022 marked a new era for these technologies, making advanced AI models accessible to a wide range of users, a development poised to permanently reshape how our societies function.

From a public policy perspective, this capacity offers the optimistic potential to enable personalised services at scale, potentially revolutionising healthcare, education, local services, democratic processes, and justice, tailoring them to everyone’s unique needs in a digitally connected society. The ambition is to achieve better outcomes than humanity has managed so far without AI assistance. There is certainly a vast opportunity for improvement, given the current state of global inequity, environmental degradation, polarised societies, and other chronic challenges facing humanity.

However, it is crucial to temper this optimism with recognising the significant risks. In their current trajectories, these technologies are already starting to undermine hard-won democratic gains and civil rights. Integrating AI into public policy and decision-making processes risks exacerbating existing inequalities and unfairness, potentially leading to new, uncontrollable forms of discrimination at unprecedented speed and scale. The environmental impacts, both direct and indirect, could be catastrophic, while the rise of AI-powered personalised misinformation and behavioural manipulation is contributing to increasingly polarised societies.

Steering the direction of AI to be in the public interest requires a deeper understanding of its characteristics and behaviour. To imagine and design new approaches to public policy and decision-making, we first need a comprehensive understanding of what this remarkable technology offers and its potential implications…(More)”.

AI firms must play fair when they use academic data in training


Nature Editorial: “But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.

In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.

Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms…(More)”.

When A.I.’s Output Is a Threat to A.I. Itself


Article by Aatish Bhatia: “The internet is becoming awash in words and images generated by artificial intelligence.

Sam Altman, OpenAI’s chief executive, wrote in February that the company generated about 100 billion words per day — a million novels’ worth of text, every day, an unknown share of which finds its way onto the internet.

A.I.-generated text may show up as a restaurant review, a dating profile or a social media post. And it may show up as a news article, too: NewsGuard, a group that tracks online misinformation, recently identified over a thousand websites that churn out error-prone A.I.-generated news articles.

In reality, with no foolproof methods to detect this kind of content, much will simply remain undetected.

All this A.I.-generated information can make it harder for us to know what’s real. And it also poses a problem for A.I. companies. As they trawl the web for new data to train their next models on — an increasingly challenging task — they’re likely to ingest some of their own A.I.-generated content, creating an unintentional feedback loop in which what was once the output from one A.I. becomes the input for another.

In the long run, this cycle may pose a threat to A.I. itself. Research has shown that when generative A.I. is trained on a lot of its own output, it can get a lot worse.

Here’s a simple illustration of what happens when an A.I. system is trained on its own output, over and over again:

This is part of a data set of 60,000 handwritten digits.

When we trained an A.I. to mimic those digits, its output looked like this.

This new set was made by an A.I. trained on the previous A.I.-generated digits. What happens if this process continues?

After 20 generations of training new A.I.s on their predecessors’ output, the digits blur and start to erode.

After 30 generations, they converge into a single shape.

While this is a simplified example, it illustrates a problem on the horizon.

Imagine a medical-advice chatbot that lists fewer diseases that match your symptoms, because it was trained on a narrower spectrum of medical knowledge generated by previous chatbots. Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction…(More)”.

Policy for responsible use of AI in government


Policy by the Australian Government: “The Policy for the responsible use of AI in government ensures that government plays a leadership role in embracing AI for the benefit of Australians while ensuring its safe, ethical and responsible use, in line with community expectations. The policy:

  • provides a unified approach for government to engage with AI confidently, safely and responsibly, and realise its benefits
  • aims to strengthen public trust in government’s use of AI by providing enhanced transparency, governance and risk assurance
  • aims to embed a forward leaning, adaptive approach for government’s use of AI that is designed to evolve and develop over time…(More)”.

Relational ethics in health care automation


Paper by Frances Shaw and Anthony McCosker: “Despite the transformative potential of automation and clinical decision support technology in health care, there is growing urgency for more nuanced approaches to ethics. Relational ethics is an approach that can guide the responsible use of a range of automated decision-making systems including the use of generative artificial intelligence and large language models as they affect health care relationships. 

There is an urgent need for sector-wide training and scrutiny regarding the effects of automation using relational ethics touchstones, such as patient-centred health care, informed consent, patient autonomy, shared decision-making, empathy and the politics of care.

The purpose of this review is to offer a provocation for health care practitioners, managers and policy makers to consider the use automated tools in practice settings and examine how these tools might affect relationships and hence care outcomes…(More)”.

This is AI’s brain on AI


Article by Alison Snyder Data to train AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots’ knowledge gaps but also destabilize them.

The big picture: As AI models expand in size, their need for data becomes insatiable — but high quality human-made data is costly, and growing restrictions on the text, images and other kinds of data freely available on the web are driving the technology’s developers toward machine-produced alternatives.

State of play: AI-generated data has been used for years to supplement data in some fields, including medical imaging and computer vision, that use proprietary or private data.

  • But chatbots are trained on public data collected from across the internet that is increasingly being restricted — while at the same time, the web is expected to be flooded with AI-generated content.

Those constraints and the decreasing cost of generating synthetic data are spurring companies to use AI-generated data to help train their models.

  • Meta, Google, Anthropic and others are using synthetic data — alongside human-generated data — to help train the AI models that power their chatbots.
  • Google DeepMind’s new AlphaGeometry 2 system that can solve math Olympiad problems is trained from scratch on synthetic data…(More)”

Generative Discrimination: What Happens When Generative AI Exhibits Bias, and What Can Be Done About It


Paper by Philipp Hacker, Frederik Zuiderveen Borgesius, Brent Mittelstadt and Sandra Wachter: “Generative AI (genAI) technologies, while beneficial, risk increasing discrimination by producing demeaning content and subtle biases through inadequate representation of protected groups. This chapter examines these issues, categorizing problematic outputs into three legal categories: discriminatory content; harassment; and legally hard cases like harmful stereotypes. It argues for holding genAI providers and deployers liable for discriminatory outputs and highlights the inadequacy of traditional legal frameworks to address genAI-specific issues. The chapter suggests updating EU laws to mitigate biases in training and input data, mandating testing and auditing, and evolving legislation to enforce standards for bias mitigation and inclusivity as technology advances…(More)”.

A.I. May Save Us, or May Construct Viruses to Kill Us


Article by Nicholas Kristof: “Here’s a bargain of the most horrifying kind: For less than $100,000, it may now be possible to use artificial intelligence to develop a virus that could kill millions of people.

That’s the conclusion of Jason Matheny, the president of the RAND Corporation, a think tank that studies security matters and other issues.

“It wouldn’t cost more to create a pathogen that’s capable of killing hundreds of millions of people versus a pathogen that’s only capable of killing hundreds of thousands of people,” Matheny told me.

In contrast, he noted, it could cost billions of dollars to produce a new vaccine or antiviral in response…

In the early 2000s, some of us worried about smallpox being reintroduced as a bioweapon if the virus were stolen from the labs in Atlanta and in Russia’s Novosibirsk region that retain the virus since the disease was eradicated. But with synthetic biology, now it wouldn’t have to be stolen.

Some years ago, a research team created a cousin of the smallpox virus, horse pox, in six months for $100,000, and with A.I. it could be easier and cheaper to refine the virus.

One reason biological weapons haven’t been much used is that they can boomerang. If Russia released a virus in Ukraine, it could spread to Russia. But a retired Chinese general has raised the possibility of biological warfare that targets particular races or ethnicities (probably imperfectly), which would make bioweapons much more useful. Alternatively, it might be possible to develop a virus that would kill or incapacitate a particular person, such as a troublesome president or ambassador, if one had obtained that person’s DNA at a dinner or reception.

Assessments of ethnic-targeting research by China are classified, but they may be why the U.S. Defense Department has said that the most important long-term threat of biowarfare comes from China.

A.I. has a more hopeful side as well, of course. It holds the promise of improving education, reducing auto accidents, curing cancers and developing miraculous new pharmaceuticals.

One of the best-known benefits is in protein folding, which can lead to revolutionary advances in medical care. Scientists used to spend years or decades figuring out the shapes of individual proteins, and then a Google initiative called AlphaFold was introduced that could predict the shapes within minutes. “It’s Google Maps for biology,” Kent Walker, president of global affairs at Google, told me.

Scientists have since used updated versions of AlphaFold to work on pharmaceuticals including a vaccine against malaria, one of the greatest killers of humans throughout history.

So it’s unclear whether A.I. will save us or kill us first…(More)”.

The problem of ‘model collapse’: how a lack of human data limits AI progress


Article by Michael Peel: “The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results, according to new research that highlights looming challenges to the emerging technology. 

Leading AI companies, including OpenAI and Microsoft, have tested the use of “synthetic” data — information created by AI systems to then also train large language models (LLMs) — as they reach the limits of human-made material that can improve the cutting-edge technology.

Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output. 

The work underlines why AI developers have hurried to buy troves of human-generated data for training — and raises questions of what will happen once those finite sources are exhausted. 

“Synthetic data is amazing if we manage to make it work,” said Ilia Shumailov, lead author of the research. “But what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens.”

The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training.

The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. 

The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish…(More)”.

When A.I. Fails the Language Test, Who Is Left Out of the Conversation?


Article by Sara Ruberg: “While the use of A.I. has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. A.I. experts worry that the language gap could exacerbate technological inequities, and that it could leave many regions and cultures behind.

A delay of access to good technology of even a few years, “can potentially lead to a few decades of economic delay,” said Sang Truong, a Ph.D. candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

The tests his team ran found that A.I. tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the A.I. model to learn from.

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because A.I. tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.

An analysis of top websites by W3Techs, a tech survey company, found that English makes up over 60 percent of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5 percent of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

Academic institutions, grass-roots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.

Lelapa AI, based in Johannesburg, is one such company leading efforts on the African continent. The South African-based start-up is developing multilingual A.I. products for people and businesses in Africa…(More)”.