science

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

Curated on June 8, 2025June 8, 2025 by Stefaan Verhulst

Paper by Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar: “Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scaling properties, and limitations remain insufficiently understood. Current evaluations primarily focus on established mathematical and coding benchmarks, emphasizing final answer accuracy. However, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter- intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low- complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities…(More)”

The path for AI in poor nations does not need to be paved with billions

Curated on June 8, 2025June 8, 2025 by Stefaan Verhulst

Editorial in Nature: “Coinciding with US President Donald Trump’s tour of Gulf states last week, Saudi Arabia announced that it is embarking on a large-scale artificial intelligence (AI) initiative. The proposed venture will have state backing and considerable involvement from US technology firms. It is the latest move in a global expansion of AI ambitions beyond the existing heartlands of the United States, China and Europe. However, as Nature India, Nature Africa and Nature Middle East report in a series of articles on AI in low- and middle-income countries (LMICs) published on 21 May (see go.nature.com/45jy3qq), the path to home-grown AI doesn’t need to be paved with billions, or even hundreds of millions, of dollars, or depend exclusively on partners in Western nations or China…, as a News Feature that appears in the series makes plain (see go.nature.com/3yrd3u2), many initiatives in LMICs aren’t focusing on scaling up, but on ‘scaling right’. They are “building models that work for local users, in their languages, and within their social and economic realities”.

More such local initiatives are needed. Some of the most popular AI applications, such as OpenAI’s ChatGPT and Google Gemini, are trained mainly on data in European languages. That would mean that the model is less effective for users who speak Hindi, Arabic, Swahili, Xhosa and countless other languages. Countries are boosting home-grown apps by funding start-up companies, establishing AI education programmes, building AI research and regulatory capacity and through public engagement.

Those LMICs that have started investing in AI began by establishing an AI strategy, including policies for AI research. However, as things stand, most of the 55 member states of the African Union and of the 22 members of the League of Arab States have not produced an AI strategy. That must change…(More)”.

Scientific Publishing: Enough is Enough

Curated on June 4, 2025June 4, 2025 by Stefaan Verhulst

Blog by Seemay Chou: “In Abundance, Ezra Klein and Derek Thompson make the case that the biggest barriers to progress today are institutional. They’re not because of physical limitations or intellectual scarcity. They’re the product of legacy systems — systems that were built with one logic in mind, but now operate under another. And until we go back and address them at the root, we won’t get the future we say we want.

I’m a scientist. Over the past five years, I’ve experimented with science outside traditional institutes. From this vantage point, one truth has become inescapable. The journal publishing system — the core of how science is currently shared, evaluated, and rewarded — is fundamentally broken. And I believe it’s one of the legacy systems that prevents science from meeting its true potential for society.

It’s an unpopular moment to critique the scientific enterprise given all the volatility around its funding. But we do have a public trust problem. The best way to increase trust and protect science’s future is for scientists to have the hard conversations about what needs improvement. And to do this transparently. In all my discussions with scientists across every sector, exactly zero think the journal system works well. Yet we all feel trapped in a system that is, by definition, us.

I no longer believe that incremental fixes are enough. Science publishing must be built anew. I help oversee billions of dollars in funding across several science and technology organizations. We are expanding our requirement that all scientific work we fund will not go towards traditional journal publications. Instead, research we support should be released and reviewed more openly, comprehensively, and frequently than the status quo.

This policy is already in effect at Arcadia Science and Astera Institute, and we’re actively funding efforts to build journal alternatives through both Astera and The Navigation Fund. We hope others cross this line with us, and below I explain why every scientist and science funder should strongly consider it…(More)”.

WorkflowHub: a registry for computational workflows

Curated on May 24, 2025May 26, 2025 by Stefaan Verhulst

Paper by Ove Johan Ragnar Gustafsson et al: “The rising popularity of computational workflows is driven by the need for repetitive and scalable data processing, sharing of processing know-how, and transparent methods. As both combined records of analysis and descriptions of processing steps, workflows should be reproducible, reusable, adaptable, and available. Workflow sharing presents opportunities to reduce unnecessary reinvention, promote reuse, increase access to best practice analyses for non-experts, and increase productivity. In reality, workflows are scattered and difficult to find, in part due to the diversity of available workflow engines and ecosystems, and because workflow sharing is not yet part of research practice. WorkflowHub provides a unified registry for all computational workflows that links to community repositories, and supports both the workflow lifecycle and making workflows findable, accessible, interoperable, and reusable (FAIR). By interoperating with diverse platforms, services, and external registries, WorkflowHub adds value by supporting workflow sharing, explicitly assigning credit, enhancing FAIRness, and promoting workflows as scholarly artefacts. The registry has a global reach, with hundreds of research organisations involved, and more than 800 workflows registered…(More)”

Can We Trust Social Science Yet?

Curated on May 21, 2025May 21, 2025 by Stefaan Verhulst

Essay by Ryan Briggs: “Everyone likes the idea of evidence-based policy, but it’s hard to realize it when our most reputable social science journals are still publishing poor quality research.

Ideally, policy and program design is a straightforward process: a decision-maker faces a problem, turns to peer-reviewed literature, and selects interventions shown to work. In reality, that’s rarely how things unfold. The popularity of “evidence-based medicine” and other “evidence-based” topics highlights our desire for empirical approaches — but would the world actually improve if those in power consistently took social scienceevidence seriously? It brings me no joy to tell you that, at present, I think the answer is usually “no.”

Given the current state of evidence production in the social sciences, I believe that many — perhaps most — attempts to use social scientific evidence to inform policy will not lead to better outcomes. This is not because of politics or the challenges of scaling small programs. The problem is more immediate. Much of social science research is of poor quality, and sorting the trustworthy work from bad work is difficult, costly, and time-consuming.

But it is necessary. If you were to randomly select an empirical paper published in the past decade — including any studies from the top journals in political science or economics — there is a high chance that its findings may be inaccurate. And not just off by a little: possibly two times as large, or even incorrectly signed. As an academic, this bothers me. I think it should bother you, too. So let me explain why this happens…(More)”.

Public AI White Paper – A Public Alternative to Private AI Dominance

Curated on May 21, 2025May 21, 2025 by Stefaan Verhulst

White paper by the Bertelsmann Stiftung and Open Future: “Today, the most advanced AI systems are developed and controlled by a small number of private companies. These companies hold power not only over the models themselves but also over key resources such as computing infrastructure. This concentration of power poses not only economic risks but also significant democratic challenges.

The Public AI White Paper presents an alternative vision, outlining how open and public-interest approaches to AI can be developed and institutionalized. It advocates for a rebalancing of power within the AI ecosystem – with the goal of enabling societies to shape AI actively, rather than merely consume it…(More)”.

“R&D” Means Something Different on Capitol Hill

Curated on May 14, 2025May 14, 2025 by Stefaan Verhulst

Article by Sheril Kirshenbaum: “My first morning as a scientist-turned-Senate-staffer began with a misunderstanding that would become a metaphor for my impending immersion into the complex world of policymaking. When my new colleagues mentioned “R&D,” I naively assumed they were discussing critical topics related to research and development. After 10 or so confused minutes, I realized they were referring to Republicans and Democrats—my first lesson in the distinctive language and unique dynamics of congressional work. The “R&D” at the center of their world was vastly different than that of mine.In the 20 years since, I’ve moved between academic science positions and working on science policy in the Senate, under both Republican and Democratic majorities. My goal during these two decades has remained the same—to promote evidence-based policymaking that advances science and serves the public, regardless of the political landscape. But the transition from scientist to staffer has transformed my understanding of why so many efforts by scientists to influence policy falter. Despite generations of scholarly research to understand how information informs political decisions, scientists and other academics consistently overlook a crucial part of the process: the role of congressional staffers.

The staff hierarchy shapes how scientific information flows to elected officials. Chiefs of staff manage office operations and serve as the member’s closest advisors. Legislative directors oversee all policy matters, while legislative assistants (LAs) handle specific issue portfolios. One or two LAs may be designated as the office “science people,” although they often lack formal scientific training. Committee staffers provide deeper expertise and institutional knowledge on topics within their jurisdiction. In this ecosystem, few dedicated science positions exist, and science-related topics are distributed among staff already juggling multiple responsibilities…(More)”

Playing for science: Designing science games

Curated on May 2, 2025May 2, 2025 by Stefaan Verhulst

Paper by Claudio M Radaelli: “How can science have more impact on policy decisions? The P-Cube Project has approached this question by creating five pedagogical computer games based on missions given to a policy entrepreneur (the player) advocating for science-informed policy decisions. The player explores simplified strategies for policy change rooted in a small number of variables, thus making it possible to learn without a prior background in political science or public administration. The games evolved from the intuition that, instead of making additional efforts to explain science to decision-makers, we should directly empower would-be scientists (our primary audience for the games), post-graduates in public policy and administration, and activists for science. The two design principles of the games revolve around learning about how policy decisions are made (a learning-about-content principle) and reflection. Indeed, the presence of science in the policy process raises ethical and normative decisions, especially when we consider controversial strategies like civil disobedience and alliances with industry. To be on the side of science does not mean to be outside society and politics. I show the motivation, principles, scripts and pilots of the science games, reflecting on how they can be used and for what reasons…(More)”

Entering the Vortex

Curated on May 1, 2025May 1, 2025 by Stefaan Verhulst

Essay by Nils Gilman: “A strange and unsettling weather pattern is forming over the landscape of scholarly research. For decades, the climate of academic inquiry was shaped by a prevailing high-pressure system, a consensus grounded in the vision articulated by Vannevar Bush in “Science: The Endless Frontier” (1945). That era was characterized by robust federal investment, a faith in the university as the engine of basic research, and a compact that traded public funding for scientific autonomy and the promise of long-term societal benefit. It was a climate conducive to the slow, deliberate, and often unpredictable growth of knowledge, nurtured by a diverse ecosystem of human researchers — the vital “seed stock” of intellectual discovery.

But that high-pressure system is collapsing. A brutal, unyielding cold front of academic defunding has swept across the nation, a consequence of shifting political priorities, populist resentment, and a calculated assault on the university as an institution perceived as hostile to certain political agendas. This is not merely a belt-tightening exercise; it is, for all intents and purposes, the dismantling of Vannevar Bush’s Compact, the end of the era of “big government”-funded Wissenschaft. Funding streams for basic research are dwindling, grant applications face increasingly long odds, and the financial precarity of academic careers deters the brightest minds. The human capital necessary for sustained, fundamental inquiry is beginning to wither.

Simultaneously, a warm, moisture-laden airmass is rapidly advancing: the astonishing rise of AI-based research tools. Powered by vast datasets and sophisticated algorithms, these tools promise to revolutionize every stage of the research process – from literature review and data analysis to hypothesis generation and the drafting of scholarly texts. As a recent New Yorker piece on AI and the humanities suggests, these AI engines can already generate deep research and coherent texts on virtually any subject, seemingly within moments. They offer the prospect of unprecedented efficiency, speed, and scale in the production of scholarly output.

The collision of these two epochal weather systems — the brutal cold front of academic defunding and the warm, expansive airmass of AI-based research tools — is creating an atmospheric instability unlike anything the world of scholarship has ever witnessed. Along the front where these forces meet, a series of powerful and unpredictable tornados are beginning to touch down, reshaping the terrain of knowledge production in real-time…(More)”.

Real-time prices, real results: comparing crowdsourcing, AI, and traditional data collection

Curated on May 1, 2025May 1, 2025 by Stefaan Verhulst

Article by Julius Adewopo, Bo Andree, Zacharey Carmichael, Steve Penson, Kamwoo Lee: “Timely, high-quality food price data is essential for shock responsive decision-making. However, in many low- and middle-income countries, such data is often delayed, limited in geographic coverage, or unavailable due to operational constraints. Traditional price monitoring, which relies on structured surveys conducted by trained enumerators, is often constrained by challenges related to cost, frequency, and reach.

To help overcome these limitations, the World Bank launched the Real-Time Prices (RTP) data platform. This effort provides monthly price data using a machine learning framework. The models combine survey results with predictions derived from observations in nearby markets and related commodities. This approach helps fill gaps in local price data across a basket of goods, enabling real-time monitoring of inflation dynamics even when survey data is incomplete or irregular.

In parallel, new approaches—such as citizen-submitted (crowdsourced) data—are being explored to complement conventional data collection methods. These crowdsourced data were recently published in a Nature Scientific Data paper. While the adoption of these innovations is accelerating, maintaining trust requires rigorous validation.

A newly published study in PLOS compares the two emerging methods with the traditional, enumerator-led gold standard, providing new evidence that both crowdsourced and AI-imputed prices can serve as credible, timely alternatives to traditional ground-truth data collection—especially in contexts where conventional methods face limitations…(More)”.