Facing & mitigating common challenges when working with real-world data: The Data Learning Paradigm


Paper by Jake Lever et al: “The rapid growth of data-driven applications is ubiquitous across virtually all scientific domains, and has led to an increasing demand for effective methods to handle data deficiencies and mitigate the effects of imperfect data. This paper presents a guide for researchers encountering real-world data-driven applications, and the respective challenges associated with this. This article proposes the concept of the Data Learning Paradigm, combining the principles of machine learning, data science and data assimilation to tackle real-world challenges in data-driven applications. Models are a product of the data upon which they are trained, and no data collected from real world scenarios is perfect due to natural limitations of sensing and collection. Thus, computational modelling of real world systems is intrinsically limited by the various deficiencies encountered in real data. The Data Learning Paradigm aims to leverage the strengths of data improvement to enhance the accuracy, reliability, and interpretability of data-driven models. We outline a range of methods which are currently being implemented in the field of Data Learning involving machine learning and data science methods, and discuss how these mitigate the various problems associated with data-driven models, illustrating improved results in a multitude of real world applications. We highlight examples where these methods have led to significant advancements in fields such as environmental monitoring, planetary exploration, healthcare analytics, linguistic analysis, social networks, and smart manufacturing. We offer a guide to how these methods may be implemented to deal with general types of limitations in data, alongside their current and potential applications…(More)”.

Kickstarting Collaborative, AI-Ready Datasets in the Life Sciences with Government-funded Projects


Article by Erika DeBenedictis, Ben Andrew & Pete Kelly: “In the age of Artificial Intelligence (AI), large high-quality datasets are needed to move the field of life science forward. However, the research community lacks strategies to incentivize collaboration on high-quality data acquisition and sharing. The government should fund collaborative roadmapping, certification, collection, and sharing of large, high-quality datasets in life science. In such a system, nonprofit research organizations engage scientific communities to identify key types of data that would be valuable for building predictive models, and define quality control (QC) and open science standards for collection of that data. Projects are designed to develop automated methods for data collection, certify data providers, and facilitate data collection in consultation with researchers throughout various scientific communities. Hosting of the resulting open data is subsidized as well as protected by security measures. This system would provide crucial incentives for the life science community to identify and amass large, high-quality open datasets that will immensely benefit researchers…(More)”.

The AI tool that can interpret any spreadsheet instantly


Article by Duncan C. McElfresh: “Say you run a hospital and you want to estimate which patients have the highest risk of deterioration so that your staff can prioritize their care1. You create a spreadsheet in which there is a row for each patient, and columns for relevant attributes, such as age or blood-oxygen level. The final column records whether the person deteriorated during their stay. You can then fit a mathematical model to these data to estimate an incoming patient’s deterioration risk. This is a classic example of tabular machine learning, a technique that uses tables of data to make inferences. This usually involves developing — and training — a bespoke model for each task. Writing in Nature, Hollmann et al.report a model that can perform tabular machine learning on any data set without being trained specifically to do so.

Tabular machine learning shares a rich history with statistics and data science. Its methods are foundational to modern artificial intelligence (AI) systems, including large language models (LLMs), and its influence cannot be overstated. Indeed, many online experiences are shaped by tabular machine-learning models, which recommend products, generate advertisements and moderate social-media content3. Essential industries such as healthcare and finance are also steadily, if cautiously, moving towards increasing their use of AI.

Despite the field’s maturity, Hollmann and colleagues’ advance could be revolutionary. The authors’ contribution is known as a foundation model, which is a general-purpose model that can be used in a range of settings. You might already have encountered foundation models, perhaps unknowingly, through AI tools, such as ChatGPT and Stable Diffusion. These models enable a single tool to offer varied capabilities, including text translation and image generation. So what does a foundation model for tabular machine learning look like?

Let’s return to the hospital example. With spreadsheet in hand, you choose a machine-learning model (such as a neural network) and train the model with your data, using an algorithm that adjusts the model’s parameters to optimize its predictive performance (Fig. 1a). Typically, you would train several such models before selecting one to use — a labour-intensive process that requires considerable time and expertise. And of course, this process must be repeated for each unique task.

Figure 1 | A foundation model for tabular machine learning. a, Conventional machine-learning models are trained on individual data sets using mathematical optimization algorithms. A different model needs to be developed and trained for each task, and for each data set. This practice takes years to learn and requires extensive time and computing resources. b, By contrast, a ‘foundation’ model could be used for any machine-learning task and is pre-trained on the types of data used to train conventional models. This type of model simply reads a data set and can immediately produce inferences about new data points. Hollmann et al. developed a foundation model for tabular machine learning, in which inferences are made on the basis of tables of data. Tabular machine learning is used for tasks as varied as social-media moderation and hospital decision-making, so the authors’ advance is expected to have a profound effect in many areas…(More)”

Participatory seascape mapping: A community-based approach to ocean governance and marine conservation


Paper by Isabel James: “Despite the global proliferation of ocean governance frameworks that feature socioeconomic variables, the inclusion of community needs and local ecological knowledge remains underrepresented. Participatory mapping or Participatory GIS (PGIS) has emerged as a vital method to address this gap by engaging communities that are conventionally excluded from ocean planning and marine conservation. Originally developed for forest management and Indigenous land reclamation, the scholarship on PGIS remains predominantly focused on terrestrial landscapes. This review explores recent research that employs the method in the marine realm, detailing common methodologies, data types and applications in governance and conservation. A typology of ocean-centered PGIS studies was identified, comprising three main categories: fisheries, habitat classification and blue economy activities. Marine Protected Area (MPA) design and conflict management are the most prevalent conservation applications of PGIS. Case studies also demonstrate the method’s effectiveness in identifying critical marine habitats such as fish spawning grounds and monitoring endangered megafauna. Participatory mapping shows particular promise in resource and data limited contexts due to its ability to generate large quantities of relatively reliable, quick and low-cost data. Validation steps, including satellite imagery and ground-truthing, suggest encouraging accuracy of PGIS data, despite potential limitations related to human error and spatial resolution. This review concludes that participatory mapping not only enriches scientific research but also fosters trust and cooperation among stakeholders, ultimately contributing to more resilient and equitable ocean governance…(More)”.

SciAgents: Automating Scientific Discovery Through Bioinspired Multi-Agent Intelligent Graph Reasoning


Paper by Alireza Ghafarollahi, and Markus J. Buehler: “A key challenge in artificial intelligence (AI) is the creation of systems capable of autonomously advancing scientific understanding by exploring novel domains, identifying complex patterns, and uncovering previously unseen connections in vast scientific data. In this work, SciAgents, an approach that leverages three core concepts is presented: (1) large-scale ontological knowledge graphs to organize and interconnect diverse scientific concepts, (2) a suite of large language models (LLMs) and data retrieval tools, and (3) multi-agent systems with in-situ learning capabilities. Applied to biologically inspired materials, SciAgents reveals hidden interdisciplinary relationships that were previously considered unrelated, achieving a scale, precision, and exploratory power that surpasses human research methods. The framework autonomously generates and refines research hypotheses, elucidating underlying mechanisms, design principles, and unexpected material properties. By integrating these capabilities in a modular fashion, the system yields material discoveries, critiques and improves existing hypotheses, retrieves up-to-date data about existing research, and highlights strengths and limitations. This is achieved by harnessing a “swarm of intelligence” similar to biological systems, providing new avenues for discovery. How this model accelerates the development of advanced materials by unlocking Nature’s design principles, resulting in a new biocomposite with enhanced mechanical properties and improved sustainability through energy-efficient production is shown…(More)”.

Academic writing is getting harder to read—the humanities most of all


The Economist: “Academics have long been accused of jargon-filled writing that is impossible to understand. A recent cautionary tale was that of Ally Louks, a researcher who set off a social media storm with an innocuous post on X celebrating the completion of her PhD. If it was Ms Louks’s research topic (“olfactory ethics”—the politics of smell) that caught the attention of online critics, it was her verbose thesis abstract that further provoked their ire. In two weeks, the post received more than 21,000 retweets and 100m views.

Although the abuse directed at Ms Louks reeked of misogyny and anti-intellectualism—which she admirably shook off—the reaction was also a backlash against an academic use of language that is removed from normal life. Inaccessible writing is part of the problem. Research has become harder to read, especially in the humanities and social sciences. Though authors may argue that their work is written for expert audiences, much of the general public suspects that some academics use gobbledygook to disguise the fact that they have nothing useful to say. The trend towards more opaque prose hardly allays this suspicion…(More)”.

Once It Has Been Trained, Who Will Own My Digital Twin?


Article by Todd Carpenter: “Presently, if one ignores the hype around Generative AI systems, we can recognize that software tools are not sentient. Nor can they (yet) overcome the problem of coming up with creative solutions to novel problems. They are limited in what they can do by the training data that they are supplied. They do hold the prospect for making us more efficient and productive, particularly for wrote tasks. But given enough training data, one could consider how much farther this could be taken. In preparation for that future, when it comes to the digital twins, the landscape of the ownership of the intellectual property (IP) behind them is already taking shape.

Several chatbots have been set up to replicate long-dead historical figures so that you can engage with them in their “voice”.  Hellohistory is an AI-driven chatbot that provides people the opportunity to, “have in-depth conversations with history’s greatest.” A different tool, Historical Figures Chat, was widely panned not long after its release in 2023, and especially by historians who strongly objected. There are several variations on this theme of varying quality. Of course, with all things GenAI, they will improve over time and many of the obvious and problematic issues will be resolved either by this generation of companies or the next. Whether there is real value and insight to be gained, apart from the novelty, of engaging with “real historical figures” is the multi-billion dollar question. Much like the World Wide Web in the 1990s, very likely there is value, but it will be years before it can be clearly discerned what that value is and how to capitalize upon it. In anticipation of that day, many organizations are positioning themselves to capture that value.

While many universities have taken a very liberal view of ownership of the intellectual property of their students and faculty — far more liberal than many corporations might — others are quite more restrictive…(More)”.

Collaborative Intelligence


Book edited by Mira Lane and Arathi Sethumadhavan: “…The book delves deeply into the dynamic interplay between theory and practice, shedding light on the transformative potential and complexities of AI. For practitioners deeply immersed in the world of AI, Lane and Sethumadhavan offer firsthand accounts and insights from technologists, academics, and thought leaders, as well as a series of compelling case studies, ranging from AI’s impact on artistry to its role in addressing societal challenges like modern slavery and wildlife conservation.

As the global AI market burgeons, this book enables collaboration, knowledge sharing, and interdisciplinary dialogue. It caters not only to the practitioners shaping the AI landscape but also to policymakers striving to navigate the intricate relationship between humans and machines, as well as academics. Divided into two parts, the first half of the book offers readers a comprehensive understanding of AI’s historical context, its influence on power dynamics, human-AI interaction, and the critical role of audits in governing AI systems. The second half unfolds a series of eight case studies, unraveling AI’s impact on fields as varied as healthcare, vehicular safety, conservation, human rights, and the metaverse. Each chapter in this book paints a vivid picture of AI’s triumphs and challenges, providing a panoramic view of how it is reshaping our world…(More)”

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft


Article by Kate Knibbs: “Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. “It’s gone through rigorous review,” he says…(More)”.

Why probability probably doesn’t exist (but it is useful to act like it does)


Article by David Spiegelhalter: “Life is uncertain. None of us know what is going to happen. We know little of what has happened in the past, or is happening now outside our immediate experience. Uncertainty has been called the ‘conscious awareness of ignorance’1 — be it of the weather tomorrow, the next Premier League champions, the climate in 2100 or the identity of our ancient ancestors.

In daily life, we generally express uncertainty in words, saying an event “could”, “might” or “is likely to” happen (or have happened). But uncertain words can be treacherous. When, in 1961, the newly elected US president John F. Kennedy was informed about a CIA-sponsored plan to invade communist Cuba, he commissioned an appraisal from his military top brass. They concluded that the mission had a 30% chance of success — that is, a 70% chance of failure. In the report that reached the president, this was rendered as “a fair chance”. The Bay of Pigs invasion went ahead, and was a fiasco. There are now established scales for converting words of uncertainty into rough numbers. Anyone in the UK intelligence community using the term ‘likely’, for example, should mean a chance of between 55% and 75% (see go.nature.com/3vhu5zc).

Attempts to put numbers on chance and uncertainty take us into the mathematical realm of probability, which today is used confidently in any number of fields. Open any science journal, for example, and you’ll find papers liberally sprinkled with P values, confidence intervals and possibly Bayesian posterior distributions, all of which are dependent on probability.

And yet, any numerical probability, I will argue — whether in a scientific paper, as part of weather forecasts, predicting the outcome of a sports competition or quantifying a health risk — is not an objective property of the world, but a construction based on personal or collective judgements and (often doubtful) assumptions. Furthermore, in most circumstances, it is not even estimating some underlying ‘true’ quantity. Probability, indeed, can only rarely be said to ‘exist’ at all…(More)”.