Commons-based Data Set: Governance for AI


Report by Open Future: “In this white paper, we propose an approach to sharing data sets for AI training as a public good governed as a commons. By adhering to the six principles of commons-based governance, data sets can be managed in a way that generates public value while making shared resources resilient to extraction or capture by commercial interests.

The purpose of defining these principles is two-fold:

We propose these principles as input into policy debates on data and AI governance. A commons-based approach can be introduced through regulatory means, funding and procurement rules, statements of principles, or data sharing frameworks. Secondly, these principles can also serve as a blueprint for the design of data sets that are governed and shared as a commons. To this end, we also provide practical examples of how these principles are being brought to life. Projects like Big Science or Common Voice have demonstrated that commons-based data sets can be successfully built.

These principles, tailored for the governance of AI data sets, are built on our previous work on Data Commons Primer. They are also the outcome of our research into the governance of AI datasets, including the AI_Commons case study.  Finally, they are based on ongoing efforts to define how AI systems can be shared and made open, in which we have been participating – including the OSI-led process to define open-source AI systems, and the DPGA Community of Practice exploring AI systems as Digital Public Goods…(More)”.

The six principles for commons-based data set governance are as follows:

Using online search activity for earlier detection of gynaecological malignancy


Paper by Jennifer F. Barcroft et al: Ovarian cancer is the most lethal and endometrial cancer the most common gynaecological cancer in the UK, yet neither have a screening program in place to facilitate early disease detection. The aim is to evaluate whether online search data can be used to differentiate between individuals with malignant and benign gynaecological diagnoses.

This is a prospective cohort study evaluating online search data in symptomatic individuals (Google user) referred from primary care (GP) with a suspected cancer to a London Hospital (UK) between December 2020 and June 2022. Informed written consent was obtained and online search data was extracted via Google takeout and anonymised. A health filter was applied to extract health-related terms for 24 months prior to GP referral. A predictive model (outcome: malignancy) was developed using (1) search queries (terms model) and (2) categorised search queries (categories model). Area under the ROC curve (AUC) was used to evaluate model performance. 844 women were approached, 652 were eligible to participate and 392 were recruited. Of those recruited, 108 did not complete enrollment, 12 withdrew and 37 were excluded as they did not track Google searches or had an empty search history, leaving a cohort of 235.s

The cohort had a median age of 53 years old (range 20–81) and a malignancy rate of 26.0%. There was a difference in online search data between those with a benign and malignant diagnosis, noted as early as 360 days in advance of GP referral, when search queries were used directly, but only 60 days in advance, when queries were divided into health categories. A model using online search data from patients (n = 153) who performed health-related search and corrected for sample size, achieved its highest sample-corrected AUC of 0.82, 60 days prior to GP referral.

Online search data appears to be different between individuals with malignant and benign gynaecological conditions, with a signal observed in advance of GP referral date. Online search data needs to be evaluated in a larger dataset to determine its value as an early disease detection tool and whether its use leads to improved clinical outcomes…(More)”.

Responsible Data Re-use in Developing Countries: Social Licence through Public Engagement


Report by Stefaan Verhulst, Laura Sandor, Natalia Mejia Pardo, Elena Murray and Peter Addo: “The datafication era has transformed the technological landscape, digitizing multiple areas of human life and offering opportunities for societal progress through the re-use of digital data. Developing countries stand to benefit from datafication but are faced with challenges like insufficient data quality and limited infrastructure. One of the primary obstacles to unlocking data re-use lies in agency asymmetries—disparities in decision-making authority among stakeholders—which fuel public distrust. Existing consent frameworks amplify the challenge, as they are individual-focused, lack information, and fail to address the nuances of data re-use. To address these limitations, a Social License for re-use becomes imperative—a community-focused approach that fosters responsible data practices and benefits all stakeholders. This shift is crucial for establishing trust and collaboration, and bridging the gap between institutions, governments, and citizens…(More)”.

Central banks use AI to assess climate-related risks


Article by Huw Jones: “Central bankers said on Tuesday they have broken new ground by using artificial intelligence to collect data for assessing climate-related financial risks, just as the volume of disclosures from banks and other companies is set to rise.

The Bank for International Settlements, a forum for central banks, the Bank of Spain, Germany’s Bundesbank and the European Central Bank said their experimental Gaia AI project was used to analyse company disclosures on carbon emissions, green bond issuance and voluntary net-zero commitments.

Regulators of banks, insurers and asset managers need high-quality data to assess the impact of climate-change on financial institutions. However, the absence of a single reporting standard confronts them with a patchwork of public information spread across text, tables and footnotes in annual reports.

Gaia was able to overcome differences in definitions and disclosure frameworks across jurisdictions to offer much-needed transparency, and make it easier to compare indicators on climate-related financial risks, the central banks said in a joint statement.

Despite variations in how the same data is reported by companies, Gaia focuses on the definition of each indicator, rather than how the data is labelled.

Furthermore, with the traditional approach, each additional key performance indicator, or KPI, and each new institution requires the analyst to either search for the information in public corporate reports or contact the institution for information…(More)”.

Data Disquiet: Concerns about the Governance of Data for Generative AI


Paper by Susan Aaronson: “The growing popularity of large language models (LLMs) has raised concerns about their accuracy. These chatbots can be used to provide information, but it may be tainted by errors or made-up or false information (hallucinations) caused by problematic data sets or incorrect assumptions made by the model. The questionable results produced by chatbots has led to growing disquiet among users, developers and policy makers. The author argues that policy makers need to develop a systemic approach to address these concerns. The current piecemeal approach does not reflect the complexity of LLMs or the magnitude of the data upon which they are based, therefore, the author recommends incentivizing greater transparency and accountability around data-set development…(More)”.

God-like: A 500-Year History of Artificial Intelligence in Myths, Machines, Monsters


Book by Kester Brewin: “In the year 1600 a monk is burned at the stake for claiming to have built a device that will allow him to know all things.

350 years later, having witnessed ‘Trinity’ – the first test of the atomic bomb – America’s leading scientist outlines a memory machine that will help end war on earth.

25 years in the making, an ex-soldier finally unveils this ‘machine for augmenting human intellect’, dazzling as he stands ‘Zeus-like, dealing lightning with both hands.’

AI is both stunningly new and rooted in ancient desires. As we finally welcome this ‘god-like’ technology amongst us, what can learn from the myths and monsters of the past about how to survive alongside our greatest ever invention?…(More)”.

Meta to shut off data access to journalists


Article by Sara Fischer: “Meta plans to officially shutter CrowdTangle, the analytics tool widely used by journalists and researchers to see what’s going viral on Facebook and Instagram, the company’s president of global affairs Nick Clegg told Axios in an interview.

Why it matters: The company plans to instead offer select researchers access to a set of new data tools, but news publishers, journalists or anyone with commercial interests will not be granted access to that data.

The big picture: The effort comes amid a broader pivot from Meta away from news and politics and more toward user-generated viral videos.

  • Meta acquired CrowdTangle in 2016 at a time when publishers were heavily reliant on the tech giant for traffic.
  • In recent years, it’s stopped investing in the tool, making it less reliable.

The new research tools include Meta’s Content Library, which it launched last year, and an API, or backend interface used by developers.

  • Both tools offer researchers access to huge swaths of data from publicly accessible content across Facebook and Instagram.
  • The tools are available in 180 languages and offer global data.
  • Researchers must apply for access to those tools through the Inter-university Consortium for Political and Social Research at the University of Michigan, which will vet their requests…(More)”

A typology of artificial intelligence data work


Article by James Muldoon et al: “This article provides a new typology for understanding human labour integrated into the production of artificial intelligence systems through data preparation and model evaluation. We call these forms of labour ‘AI data work’ and show how they are an important and necessary element of the artificial intelligence production process. We draw on fieldwork with an artificial intelligence data business process outsourcing centre specialising in computer vision data, alongside a decade of fieldwork with microwork platforms, business process outsourcing, and artificial intelligence companies to help dispel confusion around the multiple concepts and frames that encompass artificial intelligence data work including ‘ghost work’, ‘microwork’, ‘crowdwork’ and ‘cloudwork’. We argue that these different frames of reference obscure important differences between how this labour is organised in different contexts. The article provides a conceptual division between the different types of artificial intelligence data work institutions and the different stages of what we call the artificial intelligence data pipeline. This article thus contributes to our understanding of how the practices of workers become a valuable commodity integrated into global artificial intelligence production networks…(More)”.

Riders in the smog


Article by Zuha Siddiqui, Samriddhi Sakuna and Faisal Mahmud: “…To better understand air quality exposure among gig workers in South Asia, Rest of World gave three gig workers — one each in Lahore, New Delhi, and Dhaka — air quality monitors to wear throughout a regular shift in January. The Atmotube Pro monitors continually tracked their exposure to carcinogenic pollutants — specifically PM1, PM2.5, and PM10 (different sizes of particulate matter), and volatile organic compounds such as benzene and formaldehyde.

The data revealed that all three workers were routinely exposed to hazardous levels of pollutants. For PM2.5, referring to particulates that are 2.5 micrometers in diameter or less — which have been linked to health risks including heart attacks and strokes — all riders were consistently logging exposure levels more than 10 times the World Health Organization’s recommended daily average of 15 micrograms per cubic meter. Manu Sharma, in New Delhi, recorded the highest PM2.5 level of the three riders, hitting 468.3 micrograms per cubic meter around 6 p.m. Lahore was a close second, with Iqbal recording 464.2 micrograms per cubic meter around the same time.

Alongside tracking specific pollutants, the Atmotube Pro gives an overall real-time air quality score (AQS) from 0–100, with zero being the most severely polluted, and 100 being the cleanest. According to Atmo, the company that makes the Atmotube monitors, a reading of 0–20 should be considered a health alert, under which conditions “everyone should avoid all outdoor exertion.” But the three gig workers found their monitors consistently displayed the lowest possible score…(More)”.

The New Fire: War, Peace, and Democracy in the Age of AI


Book by Ben Buchanan and Andrew Imbrie: “Artificial intelligence is revolutionizing the modern world. It is ubiquitous—in our homes and offices, in the present and most certainly in the future. Today, we encounter AI as our distant ancestors once encountered fire. If we manage AI well, it will become a force for good, lighting the way to many transformative inventions. If we deploy it thoughtlessly, it will advance beyond our control. If we wield it for destruction, it will fan the flames of a new kind of war, one that holds democracy in the balance. As AI policy experts Ben Buchanan and Andrew Imbrie show in The New Fire, few choices are more urgent—or more fascinating—than how we harness this technology and for what purpose.

The new fire has three sparks: data, algorithms, and computing power. These components fuel viral disinformation campaigns, new hacking tools, and military weapons that once seemed like science fiction. To autocrats, AI offers the prospect of centralized control at home and asymmetric advantages in combat. It is easy to assume that democracies, bound by ethical constraints and disjointed in their approach, will be unable to keep up. But such a dystopia is hardly preordained. Combining an incisive understanding of technology with shrewd geopolitical analysis, Buchanan and Imbrie show how AI can work for democracy. With the right approach, technology need not favor tyranny…(More)”.