science

Once It Has Been Trained, Who Will Own My Digital Twin?

Curated on December 23, 2024December 23, 2024 by Stefaan Verhulst

Article by Todd Carpenter: “Presently, if one ignores the hype around Generative AI systems, we can recognize that software tools are not sentient. Nor can they (yet) overcome the problem of coming up with creative solutions to novel problems. They are limited in what they can do by the training data that they are supplied. They do hold the prospect for making us more efficient and productive, particularly for wrote tasks. But given enough training data, one could consider how much farther this could be taken. In preparation for that future, when it comes to the digital twins, the landscape of the ownership of the intellectual property (IP) behind them is already taking shape.

Several chatbots have been set up to replicate long-dead historical figures so that you can engage with them in their “voice”. Hellohistory is an AI-driven chatbot that provides people the opportunity to, “have in-depth conversations with history’s greatest.” A different tool, Historical Figures Chat, was widely panned not long after its release in 2023, and especially by historians who strongly objected. There are several variations on this theme of varying quality. Of course, with all things GenAI, they will improve over time and many of the obvious and problematic issues will be resolved either by this generation of companies or the next. Whether there is real value and insight to be gained, apart from the novelty, of engaging with “real historical figures” is the multi-billion dollar question. Much like the World Wide Web in the 1990s, very likely there is value, but it will be years before it can be clearly discerned what that value is and how to capitalize upon it. In anticipation of that day, many organizations are positioning themselves to capture that value.

While many universities have taken a very liberal view of ownership of the intellectual property of their students and faculty — far more liberal than many corporations might — others are quite more restrictive…(More)”.

Collaborative Intelligence

Curated on December 21, 2024January 2, 2025 by Stefaan Verhulst

Book edited by Mira Lane and Arathi Sethumadhavan: “…The book delves deeply into the dynamic interplay between theory and practice, shedding light on the transformative potential and complexities of AI. For practitioners deeply immersed in the world of AI, Lane and Sethumadhavan offer firsthand accounts and insights from technologists, academics, and thought leaders, as well as a series of compelling case studies, ranging from AI’s impact on artistry to its role in addressing societal challenges like modern slavery and wildlife conservation.

As the global AI market burgeons, this book enables collaboration, knowledge sharing, and interdisciplinary dialogue. It caters not only to the practitioners shaping the AI landscape but also to policymakers striving to navigate the intricate relationship between humans and machines, as well as academics. Divided into two parts, the first half of the book offers readers a comprehensive understanding of AI’s historical context, its influence on power dynamics, human-AI interaction, and the critical role of audits in governing AI systems. The second half unfolds a series of eight case studies, unraveling AI’s impact on fields as varied as healthcare, vehicular safety, conservation, human rights, and the metaverse. Each chapter in this book paints a vivid picture of AI’s triumphs and challenges, providing a panoramic view of how it is reshaping our world…(More)”

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft

Curated on December 18, 2024December 18, 2024 by Stefaan Verhulst

Article by Kate Knibbs: “Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. “It’s gone through rigorous review,” he says…(More)”.

Why probability probably doesn’t exist (but it is useful to act like it does)

Curated on December 16, 2024December 19, 2024 by Stefaan Verhulst

Article by David Spiegelhalter: “Life is uncertain. None of us know what is going to happen. We know little of what has happened in the past, or is happening now outside our immediate experience. Uncertainty has been called the ‘conscious awareness of ignorance’¹ — be it of the weather tomorrow, the next Premier League champions, the climate in 2100 or the identity of our ancient ancestors.

In daily life, we generally express uncertainty in words, saying an event “could”, “might” or “is likely to” happen (or have happened). But uncertain words can be treacherous. When, in 1961, the newly elected US president John F. Kennedy was informed about a CIA-sponsored plan to invade communist Cuba, he commissioned an appraisal from his military top brass. They concluded that the mission had a 30% chance of success — that is, a 70% chance of failure. In the report that reached the president, this was rendered as “a fair chance”. The Bay of Pigs invasion went ahead, and was a fiasco. There are now established scales for converting words of uncertainty into rough numbers. Anyone in the UK intelligence community using the term ‘likely’, for example, should mean a chance of between 55% and 75% (see go.nature.com/3vhu5zc).

Attempts to put numbers on chance and uncertainty take us into the mathematical realm of probability, which today is used confidently in any number of fields. Open any science journal, for example, and you’ll find papers liberally sprinkled with P values, confidence intervals and possibly Bayesian posterior distributions, all of which are dependent on probability.

And yet, any numerical probability, I will argue — whether in a scientific paper, as part of weather forecasts, predicting the outcome of a sports competition or quantifying a health risk — is not an objective property of the world, but a construction based on personal or collective judgements and (often doubtful) assumptions. Furthermore, in most circumstances, it is not even estimating some underlying ‘true’ quantity. Probability, indeed, can only rarely be said to ‘exist’ at all…(More)”.

The AI revolution is running out of data. What can researchers do?

Curated on December 11, 2024December 11, 2024 by Stefaan Verhulst

Article by Nicola Jones: “The Internet is a vast ocean of human knowledge, but it isn’t infinite. And artificial intelligence (AI) researchers have nearly sucked it dry.

The past decade of explosive improvement in AI has been driven in large part by making neural networks bigger and training them on ever-more data. This scaling has proved surprisingly effective at making large language models (LLMs) — such as those that power the chatbot ChatGPT — both more capable of replicating conversational language and of developing emergent properties such as reasoning. But some specialists say that we are now approaching the limits of scaling. That’s in part because of the ballooning energy requirements for computing. But it’s also because LLM developers are running out of the conventional data sets used to train their models.

A prominent study¹ made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets.

The imminent bottleneck in training data could be starting to pinch. “I strongly suspect that’s already happening,” says Longpre…(More)”

Running out of data: Chart showing projections of the amount of text data used to train large language models and the amount of available text on the Internet, suggesting that by 2028, developers will be using data sets that match the total amount of text that is available.

Rethinking Theories of Governance

Curated on December 7, 2024December 5, 2024 by Stefaan Verhulst

Book by Christopher Ansell: “Are theories of governance useful for helping policymakers and citizens meet and tackle contemporary challenges? This insightful book reflects on how a theory becomes useful and evaluates a range of theories according to whether they are warranted, diagnostic, and dialogical.

By arguing that useful theory tells us what to ask, not what to do, Christopher Ansell investigates what it means for a theory to be useful. Analysing how governance theories address a variety of specific challenges, chapters examine intractable public problems, weak government accountability, violent conflict, global gridlock, poverty and the unsustainable exploitation of our natural resources. Finding significant tensions between state- and society-centric perspectives on governance, the book concludes with a suggestion that we refocus our theories of governance on possibilities for state-society synergy. Governance theories of the future, Ansell argues, should also strive for a more fruitful dialogue between instrumental, critical and explanatory perspectives.

Examining both the conceptual and empirical basis of theories of governance, this comprehensive book will be an invigorating read for scholars and students in the fields of public administration, public policy and planning, development studies, political science and urban, environmental and global governance. By linking theories of governance to concrete societal challenges, it will also be of use to policymakers and practitioners concerned with these fields…(More)”.

An Open Source Python Library for Anonymizing Sensitive Data

Curated on December 4, 2024December 5, 2024 by Stefaan Verhulst

Paper by Judith Sáinz-Pardo Díaz & Álvaro López García: “Open science is a fundamental pillar to promote scientific progress and collaboration, based on the principles of open data, open source and open access. However, the requirements for publishing and sharing open data are in many cases difficult to meet in compliance with strict data protection regulations. Consequently, researchers need to rely on proven methods that allow them to anonymize their data without sharing it with third parties. To this end, this paper presents the implementation of a Python library for the anonymization of sensitive tabular data. This framework provides users with a wide range of anonymization methods that can be applied on the given dataset, including the set of identifiers, quasi-identifiers, generalization hierarchies and allowed level of suppression, along with the sensitive attribute and the level of anonymity required. The library has been implemented following best practices for integration and continuous development, as well as the use of workflows to test code coverage based on unit and functional tests…(More)”.

Civic Engagement & Policymaking Toolkit

Curated on December 4, 2024December 5, 2024 by Stefaan Verhulst

About: “This toolkit serves as a guide for science centers and museums and other science engagement organizations to thoughtfully identify and implement ways to nurture civic experiences like these across their work or deepen ongoing civic initiatives for meaningful change within their communities…

This toolkit outlines a Community Science Approach, Civic Engagement & Policymaking, where science and technology are factors in collective civic action and policy decisions to meet community goals. It includes:

Guidance for your team on how to get started with this work,
An overview of what Civic Engagement & Policymaking as a Community Science Approach can entail,
Descriptions of four roles your organization can play to authentically engage with communities on civic priorities,
Examples of real collaborations between science engagement organizations and their partners that advance community priorities,
Tools, guides, and other resources to help you prepare for new civic engagement efforts and/or expand or deepen existing civic engagement efforts…(More)”.

Garden city: A synthetic dataset and sandbox environment for analysis of pre-processing algorithms for GPS human mobility data

Curated on December 4, 2024December 4, 2024 by Stefaan Verhulst

Paper by Thomas H. Li, and Francisco Barreras: “Human mobility datasets have seen increasing adoption in the past decade, enabling diverse applications that leverage the high precision of measured trajectories relative to other human mobility datasets. However, there are concerns about whether the high sparsity in some commercial datasets can introduce errors due to lack of robustness in processing algorithms, which could compromise the validity of downstream results. The scarcity of “ground-truth” data makes it particularly challenging to evaluate and calibrate these algorithms. To overcome these limitations and allow for an intermediate form of validation of common processing algorithms, we propose a synthetic trajectory simulator and sandbox environment meant to replicate the features of commercial datasets that could cause errors in such algorithms, and which can be used to compare algorithm outputs with “ground-truth” synthetic trajectories and mobility diaries. Our code is open-source and is publicly available alongside tutorial notebooks and sample datasets generated with it….(More)”

Scientists Scramble to Save Climate Data from Trump—Again

Curated on December 1, 2024December 5, 2024 by Stefaan Verhulst

Article by Chelsea Harvey: “Eight years ago, as the Trump administration was getting ready to take office for the first time, mathematician John Baez was making his own preparations.

Together with a small group of friends and colleagues, he was arranging to download large quantities of public climate data from federal websites in order to safely store them away. Then-President-elect Donald Trump had repeatedly denied the basic science of climate change and had begun nominating climate skeptics for cabinet posts. Baez, a professor at the University of California, Riverside, was worried the information — everything from satellite data on global temperatures to ocean measurements of sea-level rise — might soon be destroyed.

His effort, known as the Azimuth Climate Data Backup Project, archived at least 30 terabytes of federal climate data by the end of 2017.

In the end, it was an overprecaution.

The first Trump administration altered or deleted numerous federal web pages containing public-facing climate information, according to monitoring efforts by the nonprofit Environmental Data and Governance Initiative (EDGI), which tracks changes on federal websites. But federal databases, containing vast stores of globally valuable climate information, remained largely intact through the end of Trump’s first term.

Yet as Trump prepares to take office again, scientists are growing more worried.

Federal datasets may be in bigger trouble this time than they were under the first Trump administration, they say. And they’re preparing to begin their archiving efforts anew.

“This time around we expect them to be much more strategic,” said Gretchen Gehrke, EDGI’s website monitoring program lead. “My guess is that they’ve learned their lessons.”

The Trump transition team didn’t respond to a request for comment.

Like Baez’s Azimuth project, EDGI was born in 2016 in response to Trump’s first election. They weren’t the only ones…(More)”.