New AI standards group wants to make data scraping opt-in


Article by Kate Knibbs: “The first wave of major generative AI tools largely were trained on “publicly available” data—basically, anything and everything that could be scraped from the Internet. Now, sources of training data are increasingly restricting access and pushing for licensing agreements. With the hunt for additional data sources intensifying, new licensing startups have emerged to keep the source material flowing.

The Dataset Providers Alliance, a trade group formed this summer, wants to make the AI industry more standardized and fair. To that end, it has just released a position paper outlining its stances on major AI-related issues. The alliance is made up of seven AI licensing companies, including music copyright-management firm Rightsify, Japanese stock-photo marketplace Pixta, and generative-AI copyright-licensing startup Calliope Networks. (At least five new members will be announced in the fall.)

The DPA advocates for an opt-in system, meaning that data can be used only after consent is explicitly given by creators and rights holders. This represents a significant departure from the way most major AI companies operate. Some have developed their own opt-out systems, which put the burden on data owners to pull their work on a case-by-case basis. Others offer no opt-outs whatsoever…(More)”.

Policies must be justified by their wellbeing-to-cost ratio


Article by Richard Layard: “…What is its value for money — that is, how much wellbeing does it deliver per (net) pound it costs the government? This benefit/cost ratio (or BCR) should be central to every discussion.

The science exists to produce these numbers and, if the British government were to require them of the spending departments, it would be setting an example of rational government to the whole world.

Such a move would, of course, lead to major changes in priorities. At the London School of Economics we have been calculating the benefits and costs of policies across a whole range of government departments.

In our latest report on value for money, the best policies are those that save the government more money than they cost — for example by getting people back to work. Classic examples of this are treatments for mental health. The NHS Talking Therapies programme now treats 750,000 people a year for anxiety disorders and depression. Half of them recover and the service demonstrably pays for itself. It needs to expand.

But we also need a parallel service for those addicted to alcohol, drugs and gambling. These individuals are more difficult to treat — but the savings if they recover are greater. Again, it will pay for itself. And so will the improved therapy service for children and young people that Labour has promised.

However, most spending policies do cost more than they save. For these it is crucial to measure the benefit/cost ratio, converting the wellbeing benefit into its monetary equivalent. For example, we can evaluate the wellbeing gain to a community of having more police and subsequently less crime. Once this is converted into money, we calculate that the benefit/cost ratio is 12:1 — very high…(More)”.

AI has a democracy problem. Citizens’ assemblies can help.


Article by Jack Stilgoe: “…With AI, beneath all the hype, some companies know that they have a democracy problem. OpenAI admitted as much when they funded a program of pilot projects for what they called “Democratic Inputs to AI.” There have been some interesting efforts to involve the public in rethinking cutting-edge AI. A collaboration between Anthropic, one of OpenAI’s competitors, and the Collective Intelligence Project asked 1000 Americans to help shape what they called “Collective Constitutional AI.” They were asked to vote on statements such as “the AI should not be toxic” and “AI should be interesting,” and they were given the option of adding their own statements (one of the stranger statements reads “AI should not spread Marxist communistic ideology”). Anthropic used these inputs to tweak its “Claude” Large Language Model, which, when tested against standard AI benchmarks, seemed to help mitigate the model’s biases.

In using the word “constitutional,” Anthropic admits that, in making AI systems, they are doing politics by other means. We should welcome the attempt to open up. But, ultimately, these companies are interested in questions of design, not regulation. They would like there to be a societal consensus, a set of human values to which they can “align” their systems. Politics is rarely that neat…(More)”.

Breaking the Wall of Digital Heteronomy


Interview with Julia Janssen: “The walls of algorithms increasingly shape your life. Telling what to buy, where to go, what news to believe or songs to listen to. Data helps to navigate the world’s complexity and its endless possibilities. Artificial intelligence promises frictionless experiences, tailored and targeted, seamless and optimized to serve you best. But, at what cost? Frictionlessness comes with obedience. To the machine, the market and your own prophesy.

Mapping the Oblivion researches the influence of data and AI on human autonomy. The installation visualized Netflix’s percentage-based prediction models to provoke questions about to what extent we want to quantify choices. Will you only watch movies that are over 64% to your liking? Dine at restaurants that match your appetite above 76%. Date people with a compatibility rate of 89%? Will you never choose the career you want when there is only a 12% chance you’ll succeed? Do you want to outsmart your intuition with systems you do not understand and follow the map of probabilities and statistics?

Digital heteronomy is a condition in which one is guided by data, governed by AI and ordained by the industry. Homo Sapiens, the knowing being becomes Homo Stultus, the controllable being.

Living a quantified life in a numeric world. Not having to choose, doubt or wonder. Kept safe, risk-free and predictable within algorithmic walls. Exhausted of autonomy, creativity and randomness. Imprisoned in bubbles, profiles and behavioural tribes. Controllable, observable and monetizable.

Breaking the wall of digital heteronomy means taking back control over our data, identity, choices and chances in life. Honouring the unexpected, risk, doubt and having an unknown future. Shattering the power structures created by Big Tech to harvest information and capitalize on unfairness, vulnerabilities and fears. Breaking the wall of digital heteronomy means breaking down a system where profit is more important than people…(More)”.

AI firms must play fair when they use academic data in training


Nature Editorial: “But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.

In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.

Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms…(More)”.

When A.I.’s Output Is a Threat to A.I. Itself


Article by Aatish Bhatia: “The internet is becoming awash in words and images generated by artificial intelligence.

Sam Altman, OpenAI’s chief executive, wrote in February that the company generated about 100 billion words per day — a million novels’ worth of text, every day, an unknown share of which finds its way onto the internet.

A.I.-generated text may show up as a restaurant review, a dating profile or a social media post. And it may show up as a news article, too: NewsGuard, a group that tracks online misinformation, recently identified over a thousand websites that churn out error-prone A.I.-generated news articles.

In reality, with no foolproof methods to detect this kind of content, much will simply remain undetected.

All this A.I.-generated information can make it harder for us to know what’s real. And it also poses a problem for A.I. companies. As they trawl the web for new data to train their next models on — an increasingly challenging task — they’re likely to ingest some of their own A.I.-generated content, creating an unintentional feedback loop in which what was once the output from one A.I. becomes the input for another.

In the long run, this cycle may pose a threat to A.I. itself. Research has shown that when generative A.I. is trained on a lot of its own output, it can get a lot worse.

Here’s a simple illustration of what happens when an A.I. system is trained on its own output, over and over again:

This is part of a data set of 60,000 handwritten digits.

When we trained an A.I. to mimic those digits, its output looked like this.

This new set was made by an A.I. trained on the previous A.I.-generated digits. What happens if this process continues?

After 20 generations of training new A.I.s on their predecessors’ output, the digits blur and start to erode.

After 30 generations, they converge into a single shape.

While this is a simplified example, it illustrates a problem on the horizon.

Imagine a medical-advice chatbot that lists fewer diseases that match your symptoms, because it was trained on a narrower spectrum of medical knowledge generated by previous chatbots. Or an A.I. history tutor that ingests A.I.-generated propaganda and can no longer separate fact from fiction…(More)”.

This is AI’s brain on AI


Article by Alison Snyder Data to train AI models increasingly comes from other AI models in the form of synthetic data, which can fill in chatbots’ knowledge gaps but also destabilize them.

The big picture: As AI models expand in size, their need for data becomes insatiable — but high quality human-made data is costly, and growing restrictions on the text, images and other kinds of data freely available on the web are driving the technology’s developers toward machine-produced alternatives.

State of play: AI-generated data has been used for years to supplement data in some fields, including medical imaging and computer vision, that use proprietary or private data.

  • But chatbots are trained on public data collected from across the internet that is increasingly being restricted — while at the same time, the web is expected to be flooded with AI-generated content.

Those constraints and the decreasing cost of generating synthetic data are spurring companies to use AI-generated data to help train their models.

  • Meta, Google, Anthropic and others are using synthetic data — alongside human-generated data — to help train the AI models that power their chatbots.
  • Google DeepMind’s new AlphaGeometry 2 system that can solve math Olympiad problems is trained from scratch on synthetic data…(More)”

A.I. May Save Us, or May Construct Viruses to Kill Us


Article by Nicholas Kristof: “Here’s a bargain of the most horrifying kind: For less than $100,000, it may now be possible to use artificial intelligence to develop a virus that could kill millions of people.

That’s the conclusion of Jason Matheny, the president of the RAND Corporation, a think tank that studies security matters and other issues.

“It wouldn’t cost more to create a pathogen that’s capable of killing hundreds of millions of people versus a pathogen that’s only capable of killing hundreds of thousands of people,” Matheny told me.

In contrast, he noted, it could cost billions of dollars to produce a new vaccine or antiviral in response…

In the early 2000s, some of us worried about smallpox being reintroduced as a bioweapon if the virus were stolen from the labs in Atlanta and in Russia’s Novosibirsk region that retain the virus since the disease was eradicated. But with synthetic biology, now it wouldn’t have to be stolen.

Some years ago, a research team created a cousin of the smallpox virus, horse pox, in six months for $100,000, and with A.I. it could be easier and cheaper to refine the virus.

One reason biological weapons haven’t been much used is that they can boomerang. If Russia released a virus in Ukraine, it could spread to Russia. But a retired Chinese general has raised the possibility of biological warfare that targets particular races or ethnicities (probably imperfectly), which would make bioweapons much more useful. Alternatively, it might be possible to develop a virus that would kill or incapacitate a particular person, such as a troublesome president or ambassador, if one had obtained that person’s DNA at a dinner or reception.

Assessments of ethnic-targeting research by China are classified, but they may be why the U.S. Defense Department has said that the most important long-term threat of biowarfare comes from China.

A.I. has a more hopeful side as well, of course. It holds the promise of improving education, reducing auto accidents, curing cancers and developing miraculous new pharmaceuticals.

One of the best-known benefits is in protein folding, which can lead to revolutionary advances in medical care. Scientists used to spend years or decades figuring out the shapes of individual proteins, and then a Google initiative called AlphaFold was introduced that could predict the shapes within minutes. “It’s Google Maps for biology,” Kent Walker, president of global affairs at Google, told me.

Scientists have since used updated versions of AlphaFold to work on pharmaceuticals including a vaccine against malaria, one of the greatest killers of humans throughout history.

So it’s unclear whether A.I. will save us or kill us first…(More)”.

Supporting Scientific Citizens


Article by Lisa Margonelli: “What do nuclear fusion power plants, artificial intelligence, hydrogen infrastructure, and drinking water recycled from human waste have in common? Aside from being featured in this edition of Issues, they all require intense public engagement to choose among technological tradeoffs, safety profiles, and economic configurations. Reaching these understandings requires researchers, engineers, and decisionmakers who are adept at working with the public. It also requires citizens who want to engage with such questions and can articulate what they want from science and technology.

This issue offers a glimpse into what these future collaborations might look like. To train engineers with the “deep appreciation of the social, cultural, and ethical priorities and implications of the technological solutions engineers are tasked with designing and deploying,” University of Michigan nuclear engineer Aditi Verma and coauthors Katie Snyder and Shanna Daly asked their first-year engineering students to codesign nuclear power plants in collaboration with local community members. Although traditional nuclear engineering classes avoid “getting messy,” Verma and colleagues wanted students to engage honestly with the uncertainties of the profession. In the process of working with communities, the students’ vocabulary changed; they spoke of trust, respect, and “love” for community—even when considering deep geological waste repositories…(More)”.

Governments Empower Citizens by Promoting Digital Rights


Article by Julia Edinger: “The rapid rise of digital services and smart city technology has elevated concerns about privacy in the digital age and government’s role, even as cities from California to Texas take steps to make constituents aware of their digital rights.

Earlier this month, Long Beach, Calif., launched an improved version of its Digital Rights Platform, which shows constituents their data privacy and digital rights and information about how the city uses technologies while protecting digital rights.

“People’s digital rights are no different from their human or civil rights, except that they’re applied to how they interact with digital technologies — when you’re online, you’re still entitled to every right you enjoy offline,” said Will Greenberg, staff technologist at the Electronic Frontier Foundation (EFF), in a written statement. The nonprofit organization defends civil liberties in the digital world.


Long Beach’s platform initially launched several years ago, to mitigate privacy concerns that came out of the 2020 launch of a smart city initiative, according to Long Beach CIO Lea Eriksen. When that initiative debuted, the Department of Innovation and Technology requested the City Council approve a set of data privacy guidelines to ensure digital rights would be protected, setting the stage for the initial platform launch. Its 2021 beta version has now been enhanced to offer information on 22 city technology uses, up from two, and an enhanced feedback module enabling continued engagement and platform improvements…(More)”.