New Data Browser on education, science, and culture


UNESCO: “The UIS is excited to introduce the new UIS Data Browser, which brings together all our data on education, science, and culture, making it a convenient resource for everyone, from policymakers to researchers.

With a refreshed interface, users can easily view and download customized data for their needs. The new browser also offers better tools for exploring metadata and documentation. Plus, the browser has great visualization features. You can filter indicators by country or region and create line or bar charts to see trends over time. It’s easy to share your findings on social media, too!

For those who like to dive deeper, a web-based UIS Data Application Programming Interface (API) allows for more technical data extraction for use in reports and applications. The UIS Data API provides access to all education, science, and culture data available on the UIS data browser through HTTP requests. It allows for the regular retrieval of data for custom analysis, visualizations, and applications…(More)”.

Augmenting the availability of historical GDP per capita estimates through machine learning


Paper by Philipp Koch, Viktor Stojkoski, and César A. Hidalgo: “Can we use data on the biographies of historical figures to estimate the GDP per capita of countries and regions? Here, we introduce a machine learning method to estimate the GDP per capita of dozens of countries and hundreds of regions in Europe and North America for the past seven centuries starting from data on the places of birth, death, and occupations of hundreds of thousands of historical figures. We build an elastic net regression model to perform feature selection and generate out-of-sample estimates that explain 90% of the variance in known historical income levels. We use this model to generate GDP per capita estimates for countries, regions, and time periods for which these data are not available and externally validate our estimates by comparing them with four proxies of economic output: urbanization rates in the past 500 y, body height in the 18th century, well-being in 1850, and church building activity in the 14th and 15th century. Additionally, we show our estimates reproduce the well-known reversal of fortune between southwestern and northwestern Europe between 1300 and 1800 and find this is largely driven by countries and regions engaged in Atlantic trade. These findings validate the use of fine-grained biographical data as a method to augment historical GDP per capita estimates. We publish our estimates with CI together with all collected source data in a comprehensive dataset…(More)”.

Revisiting the ‘Research Parasite’ Debate in the Age of AI


Article by C. Brandon Ogbunu: “A 2016 editorial published in the New England Journal of Medicine lamented the existence of “research parasites,” those who pick over the data of others rather than generating new data themselves. The article touched on the ethics and appropriateness of this practice. The most charitable interpretation of the argument centered around the hard work and effort that goes into the generation of new data, which costs millions of research dollars and takes countless person-hours. Whatever the merits of that argument, the editorial and its associated arguments were widely criticized.

Given recent advances in AI, revisiting the research parasite debate offers a new perspective on the ethics of sharing and data democracy. It is ironic that the critics of research parasites might have made a sound argument — but for the wrong setting, aimed at the wrong target, at the wrong time. Specifically, the large language models, or LLMs, that underlie generative AI tools such as OpenAI’s ChatGPT, have an ethical challenge in how they parasitize freely available data. These discussions bring up new conversations about data security that may undermine, or at least complicate, efforts at openness and data democratization.

The backlash to that 2016 editorial was swift and violent. Many arguments centered around the anti-science spirit of the message. For example, metanalysis – which re-analyzes data from a selection of studies – is a critical practice that should be encouraged. Many groundbreaking discoveries about the natural world and human health have come from this practice, including new pictures of the molecular causes of depression and schizophrenia. Further, the central criticisms of research parasitism undermine the ethical goals of data sharing and ambitions for open science, where scientists and citizen-scientists can benefit from access to data. This differs from the status quo in 2016, when data published in many of the top journals of the world were locked behind a paywall, illegible, poorly labeled, or difficult to use. This remains largely true in 2024…(More)”.

The Art of Uncertainty


Book by David Spiegelhalter: “We live in a world where uncertainty is inevitable. How should we deal with what we don’t know? And what role do chance, luck and coincidence play in our lives?

David Spiegelhalter has spent his career dissecting data in order to understand risks and assess the chances of what might happen in the future. In The Art of Uncertainty, he gives readers a window onto how we can all do this better.

In engaging, crystal-clear prose, he takes us through the principles of probability, showing how it can help us think more analytically about everything from medical advice to pandemics and climate change forecasts, and explores how we can update our beliefs about the future in the face of constantly changing experience. Along the way, he explains why roughly 40% of football results come down to luck rather than talent, how the National Risk Register assesses near-term risks to the United Kingdom, and why we can be so confident that two properly shuffled packs of cards have never, ever been in the exact same order.

Drawing on a wide range of captivating real-world examples, this is an essential guide to navigating uncertainty while also having the humility to admit what we do not know…(More)”.

AI firms must play fair when they use academic data in training


Nature Editorial: “But others are worried about principles such as attribution, the currency by which science operates. Fair attribution is a condition of reuse under CC BY, a commonly used open-access copyright license. In jurisdictions such as the European Union and Japan, there are exemptions to copyright rules that cover factors such as attribution — for text and data mining in research using automated analysis of sources to find patterns, for example. Some scientists see LLM data-scraping for proprietary LLMs as going well beyond what these exemptions were intended to achieve.

In any case, attribution is impossible when a large commercial LLM uses millions of sources to generate a given output. But when developers create AI tools for use in science, a method known as retrieval-augmented generation could help. This technique doesn’t apportion credit to the data that trained the LLM, but does allow the model to cite papers that are relevant to its output, says Lucy Lu Wang, an AI researcher at the University of Washington in Seattle.

Giving researchers the ability to opt out of having their work used in LLM training could also ease their worries. Creators have this right under EU law, but it is tough to enforce in practice, says Yaniv Benhamou, who studies digital law and copyright at the University of Geneva. Firms are devising innovative ways to make it easier. Spawning, a start-up company in Minneapolis, Minnesota, has developed tools to allow creators to opt out of data scraping. Some developers are also getting on board: OpenAI’s Media Manager tool, for example, allows creators to specify how their works can be used by machine-learning algorithms…(More)”.

Supporting Scientific Citizens


Article by Lisa Margonelli: “What do nuclear fusion power plants, artificial intelligence, hydrogen infrastructure, and drinking water recycled from human waste have in common? Aside from being featured in this edition of Issues, they all require intense public engagement to choose among technological tradeoffs, safety profiles, and economic configurations. Reaching these understandings requires researchers, engineers, and decisionmakers who are adept at working with the public. It also requires citizens who want to engage with such questions and can articulate what they want from science and technology.

This issue offers a glimpse into what these future collaborations might look like. To train engineers with the “deep appreciation of the social, cultural, and ethical priorities and implications of the technological solutions engineers are tasked with designing and deploying,” University of Michigan nuclear engineer Aditi Verma and coauthors Katie Snyder and Shanna Daly asked their first-year engineering students to codesign nuclear power plants in collaboration with local community members. Although traditional nuclear engineering classes avoid “getting messy,” Verma and colleagues wanted students to engage honestly with the uncertainties of the profession. In the process of working with communities, the students’ vocabulary changed; they spoke of trust, respect, and “love” for community—even when considering deep geological waste repositories…(More)”.

Is peer review failing its peer review?


Article by First Principles: “Ivan Oransky doesn’t sugar-coat his answer when asked about the state of academic peer review: “Things are pretty bad.”

As a distinguished journalist in residence at New York University and co-founder of Retraction Watch – a site that chronicles the growing number of papers being retracted from academic journals – Oransky is better positioned than just about anyone to make such a blunt assessment. 

He elaborates further, citing a range of factors contributing to the current state of affairs. These include the publish-or-perish mentality, chatbot ghostwriting, predatory journals, plagiarism, an overload of papers, a shortage of reviewers, and weak incentives to attract and retain reviewers.

“Things are pretty bad and they have been bad for some time because the incentives are completely misaligned,” Oranksy told FirstPrinciples in a call from his NYU office. 

Things are so bad that a new world record was set in 2023: more than 10,000 research papers were retracted from academic journals. In a troubling development, 19 journals closed after being inundated by a barrage of fake research from so-called “paper mills” that churn out the scientific equivalent of clickbait, and one scientist holds the current record of 213 retractions to his name. 

“The numbers don’t lie: Scientific publishing has a problem, and it’s getting worse,” Oransky and Retraction Watch co-founder Adam Marcus wrote in a recent opinion piece for The Washington Post. “Vigilance against fraudulent or defective research has always been necessary, but in recent years the sheer amount of suspect material has threatened to overwhelm publishers.”..(More)”.

The problem of ‘model collapse’: how a lack of human data limits AI progress


Article by Michael Peel: “The use of computer-generated data to train artificial intelligence models risks causing them to produce nonsensical results, according to new research that highlights looming challenges to the emerging technology. 

Leading AI companies, including OpenAI and Microsoft, have tested the use of “synthetic” data — information created by AI systems to then also train large language models (LLMs) — as they reach the limits of human-made material that can improve the cutting-edge technology.

Research published in Nature on Wednesday suggests the use of such data could lead to the rapid degradation of AI models. One trial using synthetic input text about medieval architecture descended into a discussion of jackrabbits after fewer than 10 generations of output. 

The work underlines why AI developers have hurried to buy troves of human-generated data for training — and raises questions of what will happen once those finite sources are exhausted. 

“Synthetic data is amazing if we manage to make it work,” said Ilia Shumailov, lead author of the research. “But what we are saying is that our current synthetic data is probably erroneous in some ways. The most surprising thing is how quickly this stuff happens.”

The paper explores the tendency of AI models to collapse over time because of the inevitable accumulation and amplification of mistakes from successive generations of training.

The speed of the deterioration is related to the severity of shortcomings in the design of the model, the learning process and the quality of data used. 

The early stages of collapse typically involve a “loss of variance”, which means majority subpopulations in the data become progressively over-represented at the expense of minority groups. In late-stage collapse, all parts of the data may descend into gibberish…(More)”.

Illuminating ‘the ugly side of science’: fresh incentives for reporting negative results


Article by Rachel Brazil: “Editor-in-chief Sarahanne Field describes herself and her team at the Journal of Trial & Error as wanting to highlight the “ugly side of science — the parts of the process that have gone wrong”.

She clarifies that the editorial board of the journal, which launched in 2020, isn’t interested in papers in which “you did a shitty study and you found nothing. We’re interested in stuff that was done methodologically soundly, but still yielded a result that was unexpected.” These types of result — which do not prove a hypothesis or could yield unexplained outcomes — often simply go unpublished, explains Field, who is also an open-science researcher at the University of Groningen in the Netherlands. Along with Stefan Gaillard, one of the journal’s founders, she hopes to change that.

Calls for researchers to publish failed studies are not new. The ‘file-drawer problem’ — the stacks of unpublished, negative results that most researchers accumulate — was first described in 1979 by psychologist Robert Rosenthal. He argued that this leads to publication bias in the scientific record: the gap of missing unsuccessful results leads to overemphasis on the positive results that do get published…(More)”.

The Risks of Empowering “Citizen Data Scientists”


Article by Reid Blackman and Tamara Sipes: “Until recently, the prevailing understanding of artificial intelligence (AI) and its subset machine learning (ML) was that expert data scientists and AI engineers were the only people that could push AI strategy and implementation forward. That was a reasonable view. After all, data science generally, and AI in particular, is a technical field requiring, among other things, expertise that requires many years of education and training to obtain.

Fast forward to today, however, and the conventional wisdom is rapidly changing. The advent of “auto-ML” — software that provides methods and processes for creating machine learning code — has led to calls to “democratize” data science and AI. The idea is that these tools enable organizations to invite and leverage non-data scientists — say, domain data experts, team members very familiar with the business processes, or heads of various business units — to propel their AI efforts.

In theory, making data science and AI more accessible to non-data scientists (including technologists who are not data scientists) can make a lot of business sense. Centralized and siloed data science units can fail to appreciate the vast array of data the organization has and the business problems that it can solve, particularly with multinational organizations with hundreds or thousands of business units distributed across several continents. Moreover, those in the weeds of business units know the data they have, the problems they’re trying to solve, and can, with training, see how that data can be leveraged to solve those problems. The opportunities are significant.

In short, with great business insight, augmented with auto-ML, can come great analytic responsibility. At the same time, we cannot forget that data science and AI are, in fact, very difficult, and there’s a very long journey from having data to solving a problem. In this article, we’ll lay out the pros and cons of integrating citizen data scientists into your AI strategy and suggest methods for optimizing success and minimizing risks…(More)”.