‘Not for Machines to Harvest’: Data Revolts Break Out Against A.I.


Article by Sheera Frenkel, and Stuart A. Thompson: “Fan fiction writers are just one group now staging revolts against A.I. systems as a fever over the technology has gripped Silicon Valley and the world. In recent months, social media companies such as Reddit and Twitter, news organizations including The New York Times and NBC News, authors such as Paul Tremblay and the actress Sarah Silverman have all taken a position against A.I. sucking up their data without permission.

Their protests have taken different forms. Writers and artists are locking their files to protect their work or are boycotting certain websites that publish A.I.-generated content, while companies like Reddit want to charge for access to their data. At least 10 lawsuits have been filed this year against A.I. companies, accusing them of training their systems on artists’ creative work without consent. This past week, Ms. Silverman and the authors Christopher Golden and Richard Kadrey sued OpenAI, the maker of ChatGPT, and others over A.I.’s use of their work.

At the heart of the rebellions is a newfound understanding that online information — stories, artwork, news articles, message board posts and photos — may have significant untapped value.

The new wave of A.I. — known as “generative A.I.” for the text, images and other content it generates — is built atop complex systems such as large language models, which are capable of producing humanlike prose. These models are trained on hoards of all kinds of data so they can answer people’s questions, mimic writing styles or churn out comedy and poetry.

That has set off a hunt by tech companies for even more data to feed their A.I. systems. Google, Meta and OpenAI have essentially used information from all over the internet, including large databases of fan fiction, troves of news articles and collections of books, much of which was available free online. In tech industry parlance, this was known as “scraping” the internet…(More)”.

How do we know how smart AI systems are?


Article by Melanie Mitchell: “In 1967, Marvin Minksy, a founder of the field of artificial intelligence (AI), made a bold prediction: “Within a generation…the problem of creating ‘artificial intelligence’ will be substantially solved.” Assuming that a generation is about 30 years, Minsky was clearly overoptimistic. But now, nearly two generations later, how close are we to the original goal of human-level (or greater) intelligence in machines?

Some leading AI researchers would answer that we are quite close. Earlier this year, deep-learning pioneer and Turing Award winner Geoffrey Hinton told Technology Review, “I have suddenly switched my views on whether these things are going to be more intelligent than us. I think they’re very close to it now and they will be much more intelligent than us in the future.” His fellow Turing Award winner Yoshua Bengio voiced a similar opinion in a recent blog post: “The recent advances suggest that even the future where we know how to build superintelligent AIs (smarter than humans across the board) is closer than most people expected just a year ago.”

These are extraordinary claims that, as the saying goes, require extraordinary evidence. However, it turns out that assessing the intelligence—or more concretely, the general capabilities—of AI systems is fraught with pitfalls. Anyone who has interacted with ChatGPT or other large language models knows that these systems can appear quite intelligent. They converse with us in fluent natural language, and in many cases seem to reason, to make analogies, and to grasp the motivations behind our questions. Despite their well-known unhumanlike failings, it’s hard to escape the impression that behind all that confident and articulate language there must be genuine understanding…(More)”.

AI tools are designing entirely new proteins that could transform medicine


Article by Ewen Callaway: “OK. Here we go.” David Juergens, a computational chemist at the University of Washington (UW) in Seattle, is about to design a protein that, in 3-billion-plus years of tinkering, evolution has never produced.

On a video call, Juergens opens a cloud-based version of an artificial intelligence (AI) tool he helped to develop, called RFdiffusion. This neural network, and others like it, are helping to bring the creation of custom proteins — until recently a highly technical and often unsuccessful pursuit — to mainstream science.

These proteins could form the basis for vaccines, therapeutics and biomaterials. “It’s been a completely transformative moment,” says Gevorg Grigoryan, the co-founder and chief technical officer of Generate Biomedicines in Somerville, Massachusetts, a biotechnology company applying protein design to drug development.

The tools are inspired by AI software that synthesizes realistic images, such as the Midjourney software that, this year, was famously used to produce a viral image of Pope Francis wearing a designer white puffer jacket. A similar conceptual approach, researchers have found, can churn out realistic protein shapes to criteria that designers specify — meaning, for instance, that it’s possible to speedily draw up new proteins that should bind tightly to another biomolecule. And early experiments show that when researchers manufacture these proteins, a useful fraction do perform as the software suggests.

The tools have revolutionized the process of designing proteins in the past year, researchers say. “It is an explosion in capabilities,” says Mohammed AlQuraishi, a computational biologist at Columbia University in New York City, whose team has developed one such tool for protein design. “You can now create designs that have sought-after qualities.”

“You’re building a protein structure customized for a problem,” says David Baker, a computational biophysicist at UW whose group, which includes Juergens, developed RFdiffusion. The team released the software in March 2023, and a paper describing the neural network appears this week in Nature1. (A preprint version was released in late 2022, at around the same time that several other teams, including AlQuraishi’s2 and Grigoryan’s3, reported similar neural networks)…(More)”.

Combining Human Expertise with Artificial Intelligence: Experimental Evidence from Radiology


Paper by Nikhil Agarwal, Alex Moehring, Pranav Rajpurkar & Tobias Salz: “While Artificial Intelligence (AI) algorithms have achieved performance levels comparable to human experts on various predictive tasks, human experts can still access valuable contextual information not yet incorporated into AI predictions. Humans assisted by AI predictions could outperform both human-alone or AI-alone. We conduct an experiment with professional radiologists that varies the availability of AI assistance and contextual information to study the effectiveness of human-AI collaboration and to investigate how to optimize it. Our findings reveal that (i) providing AI predictions does not uniformly increase diagnostic quality, and (ii) providing contextual information does increase quality. Radiologists do not fully capitalize on the potential gains from AI assistance because of large deviations from the benchmark Bayesian model with correct belief updating. The observed errors in belief updating can be explained by radiologists’ partially underweighting the AI’s information relative to their own and not accounting for the correlation between their own information and AI predictions. In light of these biases, we design a collaborative system between radiologists and AI. Our results demonstrate that, unless the documented mistakes can be corrected, the optimal solution involves assigning cases either to humans or to AI, but rarely to a human assisted by AI…(More)”.

AI and the automation of work


Essay by Benedict Evans: “…We should start by remembering that we’ve been automating work for 200 years. Every time we go through a wave of automation, whole classes of jobs go away, but new classes of jobs get created. There is frictional pain and dislocation in that process, and sometimes the new jobs go to different people in different places, but over time the total number of jobs doesn’t go down, and we have all become more prosperous.

When this is happening to your own generation, it seems natural and intuitive to worry that this time, there aren’t going to be those new jobs. We can see the jobs that are going away, but we can’t predict what the new jobs will be, and often they don’t exist yet. We know (or should know), empirically, that there always have been those new jobs in the past, and that they weren’t predictable either: no-one in 1800 would have predicted that in 1900 a million Americans would work on ‘railways’ and no-one in 1900 would have predicted ‘video post-production’ or ‘software engineer’ as employment categories. But it seems insufficient to take it on faith that this will happen now just because it always has in the past. How do you know it will happen this time? Is this different?

At this point, any first-year economics student will tell us that this is answered by, amongst other things, the ‘Lump of Labour’ fallacy.

The Lump of Labour fallacy is the misconception that there is a fixed amount of work to be done, and that if some work is taken by a machine then there will be less work for people. But if it becomes cheaper to use a machine to make, say, a pair of shoes, then the shoes are cheaper, more people can buy shoes and they have more money to spend on other things besides, and we discover new things we need or want, and new jobs. The efficient gain isn’t confined to the shoe: generally, it ripples outward through the economy and creates new prosperity and new jobs. So, we don’t know what the new jobs will be, but we have a model that says, not just that there always have been new jobs, but why that is inherent in the process. Don’t worry about AI!The most fundamental challenge to this model today, I think, is to say that no, what’s really been happening for the last 200 years of automation is that we’ve been moving up the scale of human capability…(More)”.

Engaging Scientists to Prevent Harmful Exploitation of Advanced Data Analytics and Biological Data


Proceedings from the National Academies of Sciences: “Artificial intelligence (AI), facial recognition, and other advanced computational and statistical techniques are accelerating advancements in the life sciences and many other fields. However, these technologies and the scientific developments they enable also hold the potential for unintended harm and malicious exploitation. To examine these issues and to discuss practices for anticipating and preventing the misuse of advanced data analytics and biological data in a global context, the National Academies of Sciences, Engineering, and Medicine convened two virtual workshops on November 15, 2022, and February 9, 2023. The workshops engaged scientists from the United States, South Asia, and Southeast Asia through a series of presentations and scenario-based exercises to explore emerging applications and areas of research, their potential benefits, and the ethical issues and security risks that arise when AI applications are used in conjunction with biological data. This publication highlights the presentations and discussions of the workshops…(More)”.

How should a robot explore the Moon? A simple question shows the limits of current AI systems


Article by Sally Cripps, Edward Santow, Nicholas Davis, Alex Fischer and Hadi Mohasel Afshar: “..Ultimately, AI systems should help humans make better, more accurate decisions. Yet even the most impressive and flexible of today’s AI tools – such as the large language models behind the likes of ChatGPT – can have the opposite effect.

Why? They have two crucial weaknesses. They do not help decision-makers understand causation or uncertainty. And they create incentives to collect huge amounts of data and may encourage a lax attitude to privacy, legal and ethical questions and risks…

ChatGPT and other “foundation models” use an approach called deep learning to trawl through enormous datasets and identify associations between factors contained in that data, such as the patterns of language or links between images and descriptions. Consequently, they are great at interpolating – that is, predicting or filling in the gaps between known values.

Interpolation is not the same as creation. It does not generate knowledge, nor the insights necessary for decision-makers operating in complex environments.

However, these approaches require huge amounts of data. As a result, they encourage organisations to assemble enormous repositories of data – or trawl through existing datasets collected for other purposes. Dealing with “big data” brings considerable risks around security, privacy, legality and ethics.

In low-stakes situations, predictions based on “what the data suggest will happen” can be incredibly useful. But when the stakes are higher, there are two more questions we need to answer.

The first is about how the world works: “what is driving this outcome?” The second is about our knowledge of the world: “how confident are we about this?”…(More)”.

How to Regulate AI? Start With the Data


Article by Susan Ariel Aaronson: “We live in an era of data dichotomy. On one hand, AI developers rely on large data sets to “train” their systems about the world and respond to user questions. These data troves have become increasingly valuable and visible. On the other hand, despite the import of data, U.S. policy makers don’t view data governance as a vehicle to regulate AI.  

U.S. policy makers should reconsider that perspective. As an example, the European Union, and more than 30 other countries, provide their citizens with a right not to be subject to automated decision making without explicit consent. Data governance is clearly an effective way to regulate AI.

Many AI developers treat data as an afterthought, but how AI firms collect and use data can tell you a lot about the quality of the AI services they produce. Firms and researchers struggle to collect, classify, and label data sets that are large enough to reflect the real world, but then don’t adequately clean (remove anomalies or problematic data) and check their data. Also, few AI developers and deployers divulge information about the data they use to train AI systems. As a result, we don’t know if the data that underlies many prominent AI systems is complete, consistent, or accurate. We also don’t know where that data comes from (its provenance). Without such information, users don’t know if they should trust the results they obtain from AI. 

The Washington Post set out to document this problem. It collaborated with the Allen Institute for AI to examine Google’s C4 data set, a widely used and large learning model built on data scraped by bots from 15 million websites. Google then filters the data, but it understandably can’t filter the entire data set.  

Hence, this data set provides sufficient training data, but it also presents major risks for those firms or researchers who rely on it. Web scraping is generally legal in most countries as long as the scraped data isn’t used to cause harm to society, a firm, or an individual. But the Post found that the data set contained swaths of data from sites that sell pirated or counterfeit data, which the Federal Trade Commission views as harmful. Moreover, to be legal, the scraped data should not include personal data obtained without user consent or proprietary data obtained without firm permission. Yet the Post found large amounts of personal data in the data sets as well as some 200 million instances of copyrighted data denoted with the copyright symbol.

Reliance on scraped data sets presents other risks. Without careful examination of the data sets, the firms relying on that data and their clients cannot know if it contains incomplete or inaccurate data, which in turn could lead to problems of bias, propaganda, and misinformation. But researchers cannot check data accuracy without information about data provenance. Consequently, the firms that rely on such unverified data are creating some of the AI risks regulators hope to avoid. 

It makes sense for Congress to start with data as it seeks to govern AI. There are several steps Congress could take…(More)”.

How to Stay Smart in a Smart World


Book by Gerd Gigerenzer: “From dating apps and self-driving cars to facial recognition and the justice system, the increasing presence of AI has been widely championed – but there are limitations and risks too. In this book Gigerenzer shows how humans are often the greatest source of uncertainty and when people are involved, unwavering trust in complex algorithms can become a recipe for disaster. We need, now more than ever, to arm ourselves with knowledge that will help us make better decisions in a digital age.

Filled with practical examples and cutting-edge research, How to Stay Smart in a Smart World examines the growing role of AI at all levels of daily life with refreshing clarity. This book is a life raft in a sea of information and an urgent invitation to actively shape the world in which we want to live…(More)”.