Paper by Seliem El-Sayed, Ilona Kickbusch & Barbara Prainsack: “Most data governance frameworks are designed to protect the individuals from whom data originates. However, the impacts of digital practices extend to a broader population and are embedded in significant power asymmetries within and across nations. Further, inequities in digital societies impact everyone, not just those directly involved. Addressing these challenges requires an approach which moves beyond individual data control and is grounded in the values of equity and a just contribution of benefits and risks from data use. Solidarity-based data governance (in short: data solidarity), suggests prioritising data uses over data type and proposes that data uses that generate public value should be actively facilitated, those that generate significant risks and harms should be prohibited or strictly regulated, and those that generate private benefits with little or no public value should be ‘taxed’ so that profits generated by corporate data users are reinvested in the public domain. In the context of global health data governance, the public value generated by data use is crucial. This contribution clarifies the meaning, importance, and potential of public value within data solidarity and outlines methods for its operationalisation through the PLUTO tool, specifically designed to assess the public value of data uses…(More)”.
Kickstarting Collaborative, AI-Ready Datasets in the Life Sciences with Government-funded Projects
Article by Erika DeBenedictis, Ben Andrew & Pete Kelly: “In the age of Artificial Intelligence (AI), large high-quality datasets are needed to move the field of life science forward. However, the research community lacks strategies to incentivize collaboration on high-quality data acquisition and sharing. The government should fund collaborative roadmapping, certification, collection, and sharing of large, high-quality datasets in life science. In such a system, nonprofit research organizations engage scientific communities to identify key types of data that would be valuable for building predictive models, and define quality control (QC) and open science standards for collection of that data. Projects are designed to develop automated methods for data collection, certify data providers, and facilitate data collection in consultation with researchers throughout various scientific communities. Hosting of the resulting open data is subsidized as well as protected by security measures. This system would provide crucial incentives for the life science community to identify and amass large, high-quality open datasets that will immensely benefit researchers…(More)”.
The AI tool that can interpret any spreadsheet instantly
Article by Duncan C. McElfresh: “Say you run a hospital and you want to estimate which patients have the highest risk of deterioration so that your staff can prioritize their care1. You create a spreadsheet in which there is a row for each patient, and columns for relevant attributes, such as age or blood-oxygen level. The final column records whether the person deteriorated during their stay. You can then fit a mathematical model to these data to estimate an incoming patient’s deterioration risk. This is a classic example of tabular machine learning, a technique that uses tables of data to make inferences. This usually involves developing — and training — a bespoke model for each task. Writing in Nature, Hollmann et al.report a model that can perform tabular machine learning on any data set without being trained specifically to do so.
Tabular machine learning shares a rich history with statistics and data science. Its methods are foundational to modern artificial intelligence (AI) systems, including large language models (LLMs), and its influence cannot be overstated. Indeed, many online experiences are shaped by tabular machine-learning models, which recommend products, generate advertisements and moderate social-media content3. Essential industries such as healthcare and finance are also steadily, if cautiously, moving towards increasing their use of AI.
Despite the field’s maturity, Hollmann and colleagues’ advance could be revolutionary. The authors’ contribution is known as a foundation model, which is a general-purpose model that can be used in a range of settings. You might already have encountered foundation models, perhaps unknowingly, through AI tools, such as ChatGPT and Stable Diffusion. These models enable a single tool to offer varied capabilities, including text translation and image generation. So what does a foundation model for tabular machine learning look like?
Let’s return to the hospital example. With spreadsheet in hand, you choose a machine-learning model (such as a neural network) and train the model with your data, using an algorithm that adjusts the model’s parameters to optimize its predictive performance (Fig. 1a). Typically, you would train several such models before selecting one to use — a labour-intensive process that requires considerable time and expertise. And of course, this process must be repeated for each unique task.

Comparative perspectives on the regulation of large language models
Editorial to Special Issue by Cristina Poncibò and Martin Ebers: “Large language models (LLMs) represent one of the most significant technological advancements in recent decades, offering transformative capabilities in natural language processing and content generation. Their development has far-reaching implications across technological, economic and societal domains, simultaneously creating opportunities for innovation and posing profound challenges for governance and regulation. As LLMs become integral to various sectors, from education to healthcare to entertainment, regulators are scrambling to establish frameworks that ensure their safe and ethical use.
Our issue primarily examines the private ordering, regulatory responses and normative frameworks for LLMs from a comparative law perspective, with a particular focus on the European Union (EU), the United States (US) and China. An introductory part preliminarily explores the technical principles that underpin LLMs as well as their epistemological foundations. It also addresses key sector-specific legal challenges posed by LLMs, including their implications for criminal law, data protection and copyright law…(More)”.
Government reform starts with data, evidence
Article by Kshemendra Paul: “It’s time to strengthen the use of data, evidence and transparency to stop driving with mud on the windshield and to steer the government toward improving management of its programs and operations.
Existing Government Accountability Office and agency inspectors general reports identify thousands of specific evidence-based recommendations to improve efficiency, economy and effectiveness, and reduce fraud, waste and abuse. Many of these recommendations aim at program design and requirements, highlighting specific instances of overlap, redundancy and duplication. Others describe inadequate internal controls to balance program integrity with the experience of the customer, contractor or grantee. While progress is being reported in part due to stronger partnerships with IGs, much remains to be done. Indeed, GAO’s 2023 High Risk List, which it has produced going back to 1990, shows surprisingly slow progress of efforts to reduce risk to government programs and operations.
Here are a few examples:
- GAO estimates recent annual fraud of between $233 billion to $521 billion, or about 3% to 7% of federal spending. On the other hand, identified fraud with high-risk Recovery Act spending was held under 1% using data, transparency and partnerships with Offices of Inspectors General.
- GAO and IGs have collectively identified hundreds of billions in potential cost savings or improvements not yet addressed by federal agencies.
- GAO has recently described shortcomings with the government’s efforts to build evidence. While federal policymakers need good information to inform their decisions, the Commission on Evidence-Based Policymaking previously said, “too little evidence is produced to meet this need.”
One of the main reasons for agency sluggishness is the lack of agency and governmentwide use of synchronized, authoritative and shared data to support how the government manages itself.
For example, the Energy Department IG found that, “[t]he department often lacks the data necessary to make critical decisions, evaluate and effectively manage risks, or gain visibility into program results.” It is past time for the government to commit itself to move away from its widespread use of data calls, the error-prone, costly and manual aggregation of data used to support policy analysis and decision-making. Efforts to embrace data-informed approaches to manage government programs and operations are stymied by lack of basic agency and governmentwide data hygiene. While bright pockets exist, management gaps, as DOE OIG stated, “create blind spots in the universe of data that, if captured, could be used to more efficiently identify, track and respond to risks…”
The proposed approach starts with current agency operating models, then drives into management process integration to tackle root causes of dysfunction from the bottom up. It recognizes that inefficiency, fraud and other challenges are diffused, deeply embedded and have non-obvious interrelationships within the federal complex…(More)”
Survey of attitudes in a Danish public towards reuse of health data
Paper by Lea Skovgaard et al: “Everyday clinical care generates vast amounts of digital data. A broad range of actors are interested in reusing these data for various purposes. Such reuse of health data could support medical research, healthcare planning, technological innovation, and lead to increased financial revenue. Yet, reuse also raises questions about what data subjects think about the use of health data for various different purposes. Based on a survey with 1071 respondents conducted in 2021 in Denmark, this article explores attitudes to health data reuse. Denmark is renowned for its advanced integration of data infrastructures, facilitating data reuse. This is therefore a relevant setting from which to explore public attitudes to reuse, both as authorities around the globe are currently working to facilitate data reuse opportunities, and in the light of the recent agreement on the establishment in 2024 of the European Health Data Space (EHDS) within the European Union (EU). Our study suggests that there are certain forms of health data reuse—namely transnational data sharing, commercial involvement, and use of data as national economic assets—which risk undermining public support for health data reuse. However, some of the purposes that the EHDS is supposed to facilitate are these three controversial purposes. Failure to address these public concerns could well challenge the long-term legitimacy and sustainability of the data infrastructures currently under construction…(More)”

The Circle of Sharing: How Open Datasets Power AI Innovation
A Sankey diagram developed by AI World and Hugging Face:”… illustrating the flow from top open-source datasets through AI organizations to their derivative models, showcasing the collaborative nature of AI development…(More)”.

Distorted insights from human mobility data
Paper by Riccardo Gallotti, Davide Maniscalco, Marc Barthelemy & Manlio De Domenico: “The description of human mobility is at the core of many fundamental applications ranging from urbanism and transportation to epidemics containment. Data about human movements, once scarce, is now widely available thanks to new sources such as phone call detail records, GPS devices, or Smartphone apps. Nevertheless, it is still common to rely on a single dataset by implicitly assuming that the statistical properties observed are robust regardless of data gathering and processing techniques. Here, we test this assumption on a broad scale by comparing human mobility datasets obtained from 7 different data-sources, tracing 500+ millions individuals in 145 countries. We report wide quantifiable differences in the resulting mobility networks and in the displacement distribution. These variations impact processes taking place on these networks like epidemic spreading. Our results point to the need for disclosing the data processing and, overall, to follow good practices to ensure robust and reproducible results…(More)”
SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning
Paper by Alireza Ghafarollahi, and Markus J. Buehler: “A key challenge in artificial intelligence (AI) is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, SciAgents, an approach that leverages three core concepts is presented: (1) large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi-agent systems with in-situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses human research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the system yields material discoveries, critiques and improves existing hypotheses, retrieves up-to-date data about existing research, and highlights strengths and limitations. This is achieved by harnessing a “swarm of intelligence” similar to biological systems, providing new avenues for discovery. How this model accelerates the development of advanced materials by unlocking Nature’s design principles, resulting in a new biocomposite with enhanced mechanical properties and improved sustainability through energy-efficient production is shown…(More)”.
Protecting civilians in a data-driven and digitalized battlespace: Towards a minimum basic technology infrastructure
Paper by Ann Fitz-Gerald and Jenn Hennebry: “This article examines the realities of modern day warfare, including a rising trend in hybrid threats and irregular warfare which employ emerging technologies supported by digital and data-driven processes. The way in which these technologies become applied generates a widened battlefield and leads to a greater number of civilians being caught up in conflict. Humanitarian groups mandated to protect civilians have adapted their approaches to the use of new emerging technologies. However, the lack of international consensus on the use of data, the public and private nature of the actors involved in conflict, the transnational aspects of the widened battlefield, and the heightened security risks in the conflict space pose enormous challenges for the protection of civilians agenda. Based on the dual-usage aspect of emerging technologies, the challenges associated with regulation and the need for those affected by conflict to demonstrate resilience towards, and knowledge of, digital media literacy, this paper proposes the development of guidance for a “minimum basic technology infrastructure” which is supported by technology, regulation, and public awareness and education…(More)”.