Stefaan Verhulst
Article by Jeffrey Mervis: “…U.S. Secretary of Commerce Howard Lutnick has disbanded five outside panels that provide scientific and community advice to the U.S. Census Bureau and other federal statistical agencies just as preparations are ramping up for the country’s next decennial census, in 2030.
The dozens of demographers, statisticians, and public members on the five panels received nearly identical letters this week telling them that “the Secretary of Commerce has determined that the purposes for which the [committee] was established have been fulfilled, and the committee has been terminated effective February 28, 2025. Thank you for your service.”
Statistician Robert Santos, who last month resigned as Census Bureau director 3 years into his 5-year term, says he’s “terribly disappointed but not surprised” by the move, noting how a recent directive by President Donald Trump on gender identity has disrupted data collection for a host of federal surveys…(More)”.
Article by Brooke Tanner and Cameron F. Kerry: “Indigenous languages play a critical role in preserving cultural identity and transmitting unique worldviews, traditions, and knowledge, but at least 40% of the world’s 6,700 languages are currently endangered. The United Nations declared 2022-2032 as the International Decade of Indigenous Languages to draw attention to this threat, in hopes of supporting the revitalization of these languages and preservation of access to linguistic resources.
Building on the advantages of SLMs, several initiatives have successfully adapted these models specifically for Indigenous languages. Such Indigenous language models (ILMs) represent a subset of SLMs that are designed, trained, and fine-tuned with input from the communities they serve.
Case studies and applications
- Meta released No Language Left Behind (NLLB-200), a 54 billion–parameter open-source machine translation model that supports 200 languages as part of Meta’s universal speech translator project. The model includes support for languages with limited translation resources. While the model’s breadth of languages included is novel, NLLB-200 can struggle to capture the intricacies of local context for low-resource languages and often relies on machine-translated sentence pairs across the internet due to the scarcity of digitized monolingual data.
- Lelapa AI’s InkubaLM-0.4B is an SLM with applications for low-resource African languages. Trained on 1.9 billion tokens across languages including isiZulu, Yoruba, Swahili, and isiXhosa, InkubaLM-0.4B (with 400 million parameters) builds on Meta’s LLaMA 2 architecture, providing a smaller model than the original LLaMA 2 pretrained model with 7 billion parameters.
- IBM Research Brazil and the University of São Paulo have collaborated on projects aimed at preserving Brazilian Indigenous languages such as Guarani Mbya and Nheengatu. These initiatives emphasize co-creation with Indigenous communities and address concerns about cultural exposure and language ownership. Initial efforts included electronic dictionaries, word prediction, and basic translation tools. Notably, when a prototype writing assistant for Guarani Mbya raised concerns about exposing their language and culture online, project leaders paused further development pending community consensus.
- Researchers have fine-tuned pre-trained models for Nheengatu using linguistic educational sources and translations of the Bible, with plans to incorporate community-guided spellcheck tools. Since the translations relying on data from the Bible, primarily translated by colonial priests, often sounded archaic and could reflect cultural abuse and violence, they were classified as potentially “toxic” data that would not be used in any deployed system without explicit Indigenous community agreement…(More)”.
Article by Anna Massoglia: “A battle is being waged in the quiet corners of government websites and data repositories. Essential public records are disappearing and, with them, Americans’ ability to hold those in power accountable.
Take the Department of Government Efficiency, Elon Musk’s federal cost-cutting initiative. Touted as “maximally transparent,” DOGE is supposed to make government spending more efficient. But when journalists and researchers exposed major errors — from double-counting contracts to conflating caps with actual spending — DOGE didn’t fix the mistakes. Instead, it made them harder to detect.
Many Americans hoped DOGE’s work would be a step toward cutting costs and restoring trust in government. But trust must be earned. If our leaders truly want to restore faith in our institutions, they must ensure that facts remain available to everyone, not just when convenient.
Since Jan. 20, public records across the federal government have been erased. Economic indicators that guide investments, scientific datasets that drive medical breakthroughs, federal health guidelines and historical archives that inform policy decisions have all been put on the chopping block. Some missing datasets have been restored but are incomplete or have unexplained changes, rendering them unreliable.
Both Republican and Democratic administrations have played a role in limiting public access to government records. But the scale and speed of the Trump administration’s data manipulation — combined with buyouts, resignations and other restructuring across federal agencies — signal a new phase in the war on public information. This is not just about deleting files, it’s about controlling what the public sees, shaping the narrative and limiting accountability.
The Trump administration is accelerating this trend with revisions to official records. Unelected advisors are overseeing a sweeping reorganization of federal data, granting entities like DOGE unprecedented access to taxpayer records with little oversight. This is not just a bureaucratic reshuffle — it is a fundamental reshaping of the public record.
The consequences of data manipulation extend far beyond politics. When those in power control the flow of information, they can dictate collective truth. Governments that manipulate information are not just rewriting statistics — they are rewriting history.
From authoritarian regimes that have erased dissent to leaders who have fabricated economic numbers to maintain their grip on power, the dangers of suppressing and distorting data are well-documented.
Misleading or inconsistent data can be just as dangerous as opacity. When hard facts are replaced with political spin, conspiracy theories take root and misinformation fills the void.
The fact that data suppression and manipulation has occurred before does not lessen the danger, but underscores the urgency of taking proactive measures to safeguard transparency. A missing statistic today can become a missing historical fact tomorrow. Over time, that can reshape our reality…(More)”.
Article by Stuart Fulton: “In this research project, we examine how digital platforms – specifically PescaData – can be leveraged to connect small-scale fishing cooperatives with impact investors and donors, creating new pathways for sustainable blue economy financing, while simultaneously ensuring fair data practices that respect data sovereignty and traditional ecological knowledge.
PescaData emerged as a pioneering digital platform that enables fishing communities to collect more accurate data to ensure sustainable fisheries. Since then, PescaData has evolved to provide software as a service to fishing cooperatives and to allow fishers to document their solutions to environmental and economic challenges. Since 2022, small-scale fishers have used it to document nearly 300 initiatives that contribute to multiple Sustainable Development Goals.
Respecting Data Sovereignty in the Digital Age
One critical aspect of our research acknowledges the unique challenges of implementing digital tools in traditional cooperative settings. Unlike conventional tech implementations that often extract value from communities, PescaData´s approach centers on data sovereignty – the principle that fishing communities should maintain ownership and control over their data. As the PescaData case study demonstrates, a humanity-centric rather than merely user-centric approach is essential. This means designing with compassion and establishing clear governance around data from the very beginning. The data generated by fishing cooperatives represents not just information, but traditional knowledge accumulated over generations of resource management.
The fishers themselves have articulated clear principles for data governance in a cooperative model:
- Ownership: Fishers, as data producers, decide who has access and under what conditions.
- Transparency: Clear agreements on data use.
- Knowledge assessment: Highlighting fishers’ contributions and placing them in decision-making positions.
- Co-design: Ensuring the platform meets their specific needs.
- Security: Protecting collected data…(More)”.
Essay by Henry Farrell, Alison Gopnik, Cosma Shalizi, and James Evans: “Debates about artificial intelligence (AI) tend to revolve around whether large models are intelligent, autonomous agents. Some AI researchers and commentators speculate that we are on the cusp of creating agents with artificial general intelligence (AGI), a prospect anticipated with both elation and anxiety. There have also been extensive conversations about cultural and social consequences of large models, orbiting around two foci: immediate effects of these systems as they are currently used, and hypothetical futures when these systems turn into AGI agents perhaps even superintelligent AGI agents.
But this discourse about large models as intelligent agents is fundamentally misconceived. Combining ideas from social and behavioral sciences with computer science can help us understand AI systems more accurately. Large Models should not be viewed primarily as intelligent agents, but as a new kind of cultural and social technology, allowing humans to take advantage of information other humans have accumulated.
The new technology of large models combines important features of earlier technologies. Like pictures, writing, print, video, Internet search, and other such technologies, large models allow people to access information that other people have created. Large Models – currently language, vision, and multi-modal depend on the fact that the Internet has made the products of these earlier technologies readily available in machine-readable form. But like economic markets, state bureaucracies, and other social technologies, these systems not only make information widely available, they allow it to be reorganized, transformed, and restructured in distinctive ways. Adopting Herbert Simon’s terminology, large models are a new variant of the “artificial systems of human society” that process information to enable large-scale coordination…(More)”
Paper by Stefaan Verhulst and Hannah Chafetz: “Today’s global crises–from climate change to inequality–have demonstrated the need for a broader conceptual transformation in how to approach societal issues. Focusing on the questions can transform our understanding of today’s problems and unlock new discoveries and innovations that make a meaningful difference. Yet, how decision-makers go about asking questions remains an underexplored topic.
Much of our recent work has focused on advancing a new science of questions that uses participatory approaches to define and prioritize the questions that matter most. As part of this work, we convened an Interdisciplinary Committee on Establishing and Democratizing the Science of Questions to discuss why questions matter for society and the actions needed to build a movement around this new science.
In this article, we provide the main findings from these gatherings. First we outline several roles that questions can play in shaping policy, research innovation. Supported by real-world examples, we discuss how questions are a critical device for setting agendas, increasing public participation, improving coordination, and more. We then provide five key challenges in developing a systematic approach to questions raised by the Committee and potential solutions to address those challenges. Existing challenges include weak recognition of questions, lack of skills and lack of consensus on what makes a good question.
In the latter part of this piece, we propose the concept of The QLab–a global center dedicated to the research and practice of asking questions. Co-developed with the Committee, the QLab would include five core functions: Thought Leadership, Architecting the Discovery of Questions, Field Building, Institutionalization and Practice, and Research on Questioning. By focusing on these core functions, The QLab can make significant progress towards establishing a field dedicated to the art and science of asking questions…(More)”.
Paper by Joshua S. Gans: “This paper examines how the introduction of artificial intelligence (AI), particularly generative and large language models capable of interpolating precisely between known data points, reshapes scientists’ incentives for pursuing novel versus incremental research. Extending the theoretical framework of Carnehl and Schneider (2025), we analyse how decision-makers leverage AI to improve precision within well-defined knowledge domains. We identify conditions under which the availability of AI tools encourages scientists to choose more socially valuable, highly novel research projects, contrasting sharply with traditional patterns of incremental knowledge growth. Our model demonstrates a critical complementarity: scientists strategically align their research novelty choices to maximise the domain where AI can reliably inform decision-making. This dynamic fundamentally transforms the evolution of scientific knowledge, leading either to systematic “stepping stone” expansions or endogenous research cycles of strategic knowledge deepening. We discuss the broader implications for science policy, highlighting how sufficiently capable AI tools could mitigate traditional inefficiencies in scientific innovation, aligning private research incentives closely with the social optimum…(More)”.
Blog and policy brief by Jeni Tennison: “The most obvious approach to get companies to share value back to the public sector in return for access to data is to charge them. However, there are a number of challenges with a “pay to access” approach: it’s hard to set the right price; it creates access barriers, particularly for cash-poor start-ups; and it creates a public perception that the government is willing to sell their data, and might be tempted to loosen privacy-protecting governance controls in exchange for cash.
Are there other options? The policy brief explores a range of other approaches and assesses these against five goals that a value-sharing framework should ideally meet, to:
- Encourage use of public data, including by being easy for organisations to understand and administer.
- Provide a return on investment for the public sector, offsetting at least some of the costs of supporting the NDL infrastructure and minimising administrative costs.
- Promote equitable innovation and economic growth in the UK, which might mean particularly encouraging smaller, home-grown businesses.
- Create social value, particularly towards this Government’s other missions, such as achieving Net Zero or unlocking opportunity for all.
- Build public trust by being easily explainable, avoiding misaligned incentives that encourage the breaking of governance guardrails, and feeling like a fair exchange.
In brief, alternatives to a pay-to-access model that still provide direct financial returns include:
- Discounts: the public sector could secure discounts on products and services created using public data. However, this could be difficult to administer and enforce.
- Royalties: taking a percentage of charges for products and services created using public data might be similarly hard to administer and enforce, but applies to more companies.
- Equity: taking equity in startups can provide long-term returns and align with public investment goals.
- Levies: targeted taxes on businesses that use public data can provide predictable revenue and encourage data use.
- General taxation: general taxation can fund data infrastructure, but it may lack the targeted approach and public visibility of other methods.
It’s also useful to consider non-financial conditions that could be put on organisations accessing public data..(More)”.
About: “DataLumos is an ICPSR archive for valuable government data resources. ICPSR has a long commitment to safekeeping and disseminating US government and other social science data. DataLumos accepts deposits of public data resources from the community and recommendations of public data resources that ICPSR itself might add to DataLumos. Please consider making a monetary donation to sustain DataLumos…(More)”.
Report by the National Academies of Sciences, Engineering, and Medicine: “Artificial intelligence (AI) applications in the life sciences have the potential to enable advances in biological discovery and design at a faster pace and efficiency than is possible with classical experimental approaches alone. At the same time, AI-enabled biological tools developed for beneficial applications could potentially be misused for harmful purposes. Although the creation of biological weapons is not a new concept or risk, the potential for AI-enabled biological tools to affect this risk has raised concerns during the past decade.
This report, as requested by the Department of Defense, assesses how AI-enabled biological tools could uniquely impact biosecurity risk, and how advancements in such tools could also be used to mitigate these risks. The Age of AI in the Life Sciences reviews the capabilities of AI-enabled biological tools and can be used in conjunction with the 2018 National Academies report, Biodefense in the Age of Synthetic Biology, which sets out a framework for identifying the different risk factors associated with synthetic biology capabilities…(More)”