The problem of ‘model collapse’: how a lack of human data limits AI progress


Article by Michael Peel: “The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results, according to new research that highlights looming challenges to the emerging technology. 

Leading AI companies, including OpenAI and Microsoft, have tested the use of “synthetic” data — information created by AI systems to then also train large language models (LLMs) — as they reach the limits of human-made material that can improve the cutting-edge technology.

Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output. 

The work underlines why AI developers have hurried to buy troves of human-generated data for training — and raises questions of what will happen once those finite sources are exhausted. 

“Synthetic data is amazing if we manage to make it work,” said Ilia Shumailov, lead author of the research. “But what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens.”

The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training.

The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. 

The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish…(More)”.

Illuminating ‘the ugly side of science’: fresh incentives for reporting negative results


Article by Rachel Brazil: “Editor-in-chief Sarahanne Field describes herself and her team at the Journal of Trial & Error as wanting to highlight the “ugly side of science — the parts of the process that have gone wrong”.

She clarifies that the editorial board of the journal, which launched in 2020, isn’t interested in papers in which “you did a shitty study and you found nothing. We’re interested in stuff that was done methodologically soundly, but still yielded a result that was unexpected.” These types of result — which do not prove a hypothesis or could yield unexplained outcomes — often simply go unpublished, explains Field, who is also an open-science researcher at the University of Groningen in the Netherlands. Along with Stefan Gaillard, one of the journal’s founders, she hopes to change that.

Calls for researchers to publish failed studies are not new. The ‘file-drawer problem’ — the stacks of unpublished, negative results that most researchers accumulate — was first described in 1979 by psychologist Robert Rosenthal. He argued that this leads to publication bias in the scientific record: the gap of missing unsuccessful results leads to overemphasis on the positive results that do get published…(More)”.

The Risks of Empowering “Citizen Data Scientists”


Article by Reid Blackman and Tamara Sipes: “Until recently, the prevailing understanding of artificial intelligence (AI) and its subset machine learning (ML) was that expert data scientists and AI engineers were the only people that could push AI strategy and implementation forward. That was a reasonable view. After all, data science generally, and AI in particular, is a technical field requiring, among other things, expertise that requires many years of education and training to obtain.

Fast forward to today, however, and the conventional wisdom is rapidly changing. The advent of “auto-ML” — software that provides methods and processes for creating machine learning code — has led to calls to “democratize” data science and AI. The idea is that these tools enable organizations to invite and leverage non-data scientists — say, domain data experts, team members very familiar with the business processes, or heads of various business units — to propel their AI efforts.

In theory, making data science and AI more accessible to non-data scientists (including technologists who are not data scientists) can make a lot of business sense. Centralized and siloed data science units can fail to appreciate the vast array of data the organization has and the business problems that it can solve, particularly with multinational organizations with hundreds or thousands of business units distributed across several continents. Moreover, those in the weeds of business units know the data they have, the problems they’re trying to solve, and can, with training, see how that data can be leveraged to solve those problems. The opportunities are significant.

In short, with great business insight, augmented with auto-ML, can come great analytic responsibility. At the same time, we cannot forget that data science and AI are, in fact, very difficult, and there’s a very long journey from having data to solving a problem. In this article, we’ll lay out the pros and cons of integrating citizen data scientists into your AI strategy and suggest methods for optimizing success and minimizing risks…(More)”.

The Department of Everything


Article by Stephen Akey: “How do you find the life expectancy of a California condor? Google it. Or the gross national product of Morocco? Google it. Or the final resting place of Tom Paine? Google it. There was a time, however—not all that long ago—when you couldn’t Google it or ask Siri or whatever cyber equivalent comes next. You had to do it the hard way—by consulting reference books, indexes, catalogs, almanacs, statistical abstracts, and myriad other printed sources. Or you could save yourself all that time and trouble by taking the easiest available shortcut: You could call me.

From 1984 to 1988, I worked in the Telephone Reference Division of the Brooklyn Public Library. My seven or eight colleagues and I spent the days (and nights) answering exactly such questions. Our callers were as various as New York City itself: copyeditors, fact checkers, game show aspirants, journalists, bill collectors, bet settlers, police detectives, students and teachers, the idly curious, the lonely and loquacious, the park bench crazies, the nervously apprehensive. (This last category comprised many anxious patients about to undergo surgery who called us for background checks on their doctors.) There were telephone reference divisions in libraries all over the country, but this being New York City, we were an unusually large one with an unusually heavy volume of calls. And if I may say so, we were one of the best. More than one caller told me that we were a legend in the world of New York magazine publishing…(More)”.

10 profound answers about the math behind AI


Article by Ethan Siegel: “Why do machines learn? Even in the recent past, this would have been a ridiculous question, as machines — i.e., computers — were only capable of executing whatever instructions a human programmer had programmed into them. With the rise of generative AI, or artificial intelligence, however, machines truly appear to be gifted with the ability to learn, refining their answers based on continued interactions with both human and non-human users. Large language model-based artificial intelligence programs, such as ChatGPT, Claude, Gemini and more, are now so widespread that they’re replacing traditional tools, including Google searches, in applications all across the world.

How did this come to be? How did we so swiftly come to live in an era where many of us are happy to turn over aspects of our lives that traditionally needed a human expert to a computer program? From financial to medical decisions, from quantum systems to protein folding, and from sorting data to finding signals in a sea of noise, many programs that leverage artificial intelligence (AI) and machine learning (ML) are far superior at these tasks compared with even the greatest human experts.

In his new book, Why Machines Learn: The Elegant Math Behind Modern AI, science writer Anil Ananthaswamy explores all of these aspects and more. I was fortunate enough to get to do a question-and-answer interview with him, and here are the 10 most profound responses he was generous enough to give….(More)”

The MAGA Plan to End Free Weather Reports


Article by Zoë Schlanger: “In the United States, as in most other countries, weather forecasts are a freely accessible government amenity. The National Weather Service issues alerts and predictions, warning of hurricanes and excessive heat and rainfall, all at the total cost to American taxpayers of roughly $4 per person per year. Anyone with a TV, smartphone, radio, or newspaper can know what tomorrow’s weather will look like, whether a hurricane is heading toward their town, or if a drought has been forecast for the next season. Even if they get that news from a privately owned app or TV station, much of the underlying weather data are courtesy of meteorologists working for the federal government.

Charging for popular services that were previously free isn’t generally a winning political strategy. But hard-right policy makers appear poised to try to do just that should Republicans gain power in the next term. Project 2025—a nearly 900-page book of policy proposals published by the conservative think tank the Heritage Foundation—states that an incoming administration should all but dissolve the National Oceanic and Atmospheric Administration, under which the National Weather Service operates….NOAA “should be dismantled and many of its functions eliminated, sent to other agencies, privatized, or placed under the control of states and territories,” Project 2025 reads. … “The preponderance of its climate-change research should be disbanded,” the document says. It further notes that scientific agencies such as NOAA are “vulnerable to obstructionism of an Administration’s aims,” so appointees should be screened to ensure that their views are “wholly in sync” with the president’s…(More)”.

Gen AI: too much spend, too little benefit?


Article by Jason Koebler: “Investment giant Goldman Sachs published a research paper about the economic viability of generative AI which notes that there is “little to show for” the huge amount of spending on generative AI infrastructure and questions “whether this large spend will ever pay off in terms of AI benefits and returns.” 

The paper, called “Gen AI: too much spend, too little benefit?” is based on a series of interviews with Goldman Sachs economists and researchers, MIT professor Daron Acemoglu, and infrastructure experts. The paper ultimately questions whether generative AI will ever become the transformative technology that Silicon Valley and large portions of the stock market are currently betting on, but says investors may continue to get rich anyway. “Despite these concerns and constraints, we still see room for the AI theme to run, either because AI starts to deliver on its promise, or because bubbles take a long time to burst,” the paper notes. 

Goldman Sachs researchers also say that AI optimism is driving large growth in stocks like Nvidia and other S&P 500 companies (the largest companies in the stock market), but say that the stock price gains we’ve seen are based on the assumption that generative AI is going to lead to higher productivity (which necessarily means automation, layoffs, lower labor costs, and higher efficiency). These stock gains are already baked in, Goldman Sachs argues in the paper: “Although the productivity pick-up that AI promises could benefit equities via higher profit growth, we find that stocks often anticipate higher productivity growth before it materializes, raising the risk of overpaying. And using our new long-term return forecasting framework, we find that a very favorable AI scenario may be required for the S&P 500 to deliver above-average returns in the coming decade.”…(More)

Doing science backwards


Article by Stuart Ritchie: “…Usually, the process of publishing such a study would look like this: you run the study; you write it up as a paper; you submit it to a journal; the journal gets some other scientists to peer-review it; it gets published – or if it doesn’t, you either discard it, or send it off to a different journal and the whole process starts again.

That’s standard operating procedure. But it shouldn’t be. Think about the job of the peer-reviewer: when they start their work, they’re handed a full-fledged paper, reporting on a study and a statistical analysis that happened at some point in the past. It’s all now done and, if not fully dusted, then in a pretty final-looking form.

What can the reviewer do? They can check the analysis makes sense, sure; they can recommend new analyses are done; they can even, in extreme cases, make the original authors go off and collect some entirely new data in a further study – maybe the data the authors originally presented just aren’t convincing or don’t represent a proper test of the hypothesis.

Ronald Fisher described the study-first, review-later process in 1938:

To consult the statistician [or, in our case, peer-reviewer] after an experiment is finished is often merely to ask him to conduct a post mortem examination. He can perhaps say what the experiment died of.

Clearly this isn’t the optimal, most efficient way to do science. Why don’t we review the statistics and design of a study right at the beginning of the process, rather than at the end?

This is where Registered Reports come in. They’re a new (well, new-ish) way of publishing papers where, before you go to the lab, or wherever you’re collecting data, you write down your plan for your study and send it off for peer-review. The reviewers can then give you genuinely constructive criticism – you can literally construct your experiment differently depending on their suggestions. You build consensus—between you, the reviewers, and the journal editor—on the method of the study. And then, once everyone agrees on what a good study of this question would look like, you go off and do it. The key part is that, at this point, the journal agrees to publish your study, regardless of what the results might eventually look like…(More)”.

Enhancing human mobility research with open and standardized datasets


Article by Takahiro Yabe et al: “The proliferation of large-scale, passively collected location data from mobile devices has enabled researchers to gain valuable insights into various societal phenomena. In particular, research into the science of human mobility has become increasingly critical thanks to its interdisciplinary effects in various fields, including urban planning, transportation engineering, public health, disaster management, and economic analysis. Researchers in the computational social science, complex systems, and behavioral science communities have used such granular mobility data to uncover universal laws and theories governing individual and collective human behavior. Moreover, computer science researchers have focused on developing computational and machine learning models capable of predicting complex behavior patterns in urban environments. Prominent papers include pattern-based and deep learning approaches to next-location prediction and physics-inspired approaches to flow prediction and generation.

Regardless of the research problem of interest, human mobility datasets often come with substantial limitations. Existing publicly available datasets are often small, limited to specific transport modes, or geographically restricted, owing to the lack of open-source and large-scale human mobility datasets caused by privacy concerns…(More)”.

AI-Ready FAIR Data: Accelerating Science through Responsible AI and Data Stewardship


Article by Sean Hill: “Imagine a future where scientific discovery is unbound by the limitations of data accessibility and interoperability. In this future, researchers across all disciplines — from biology and chemistry to astronomy and social sciences — can seamlessly access, integrate, and analyze vast datasets with the assistance of advanced artificial intelligence (AI). This world is one where AI-ready data empowers scientists to unravel complex problems at unprecedented speeds, leading to breakthroughs in medicine, environmental conservation, technology, and more. The vision of a truly FAIR (Findable, Accessible, Interoperable, Reusable) and AI-ready data ecosystem, underpinned by Responsible AI (RAI) practices and the pivotal role of data stewards, promises to revolutionize the way science is conducted, fostering an era of rapid innovation and global collaboration…(More)”.