Monitoring global trade using data on vessel traffic


Article by Graham Pilgrim, Emmanuelle Guidetti and Annabelle Mourougane: “Rising uncertainties and geo-political tensions, together with more complex trade relations have increased the demand for data and tools to monitor global trade in a timely manner. At the same time, advances in Big Data Analytics and access to a huge quantity of alternative data – outside the realm of official statistics – have opened new avenues to monitor trade. These data can help identify bottlenecks and disruptions in real time but need to be cleaned and validated.

One such alternative data source is the Automatic Identification System (AIS), developed by the International Maritime Organisation, facilitating the tracking of vessels across the globe. The system includes messages transmitted by ships to land or satellite receivers, available in quasi real time. While it was primarily designed to ensure vessel safety, this data is particularly well suited for providing insights on trade developments, as over 80% in volume of international merchandise trade is carried by sea (UNCTAD, 2022). Furthermore, AIS data holds granular vessel information and detailed location data, which combined with other data sources can enable the identification of activity at a port (or even berth) level, by vessel type or by the jurisdiction of vessel ownership.

For a number of years, the UN Global Platform has made AIS data available to those compiling official statistics, such as National Statistics Offices (NSOs) or International Organisations. This has facilitated the development of new methodologies, for instance the automated identification of port locations (Irish Central Statistics Office, 2022). The data has also been exploited by data scientists and research centres to monitor trade in specific commodities such as Liquefied Natural Gas (QuantCube Technology, 2022) or to analyse port and shipping operations in a specific country (Tsalamanis et al., 2018). Beyond trade, the dataset has been used to track CO2 emissions from the maritime sector (Clarke et al., 2023).

New work from the OECD Statistics and Data Directorate contributes to existing research in this field in two major ways. First, it proposes a new methodology to identify ports, at a higher level of precision than in past research. Second, it builds indicators to monitor port congestion and trends in maritime trade flows and provides a tool to get detailed information and better understand those flows…(More)”.

Commons-based Data Set: Governance for AI


Report by Open Future: “In this white paper, we propose an approach to sharing data sets for AI training as a public good governed as a commons. By adhering to the six principles of commons-based governance, data sets can be managed in a way that generates public value while making shared resources resilient to extraction or capture by commercial interests.

The purpose of defining these principles is two-fold:

We propose these principles as input into policy debates on data and AI governance. A commons-based approach can be introduced through regulatory means, funding and procurement rules, statements of principles, or data sharing frameworks. Secondly, these principles can also serve as a blueprint for the design of data sets that are governed and shared as a commons. To this end, we also provide practical examples of how these principles are being brought to life. Projects like Big Science or Common Voice have demonstrated that commons-based data sets can be successfully built.

These principles, tailored for the governance of AI data sets, are built on our previous work on Data Commons Primer. They are also the outcome of our research into the governance of AI datasets, including the AI_Commons case study.  Finally, they are based on ongoing efforts to define how AI systems can be shared and made open, in which we have been participating – including the OSI-led process to define open-source AI systems, and the DPGA Community of Practice exploring AI systems as Digital Public Goods…(More)”.

The six principles for commons-based data set governance are as follows:

Using online search activity for earlier detection of gynaecological malignancy


Paper by Jennifer F. Barcroft et al: Ovarian cancer is the most lethal and endometrial cancer the most common gynaecological cancer in the UK, yet neither have a screening program in place to facilitate early disease detection. The aim is to evaluate whether online search data can be used to differentiate between individuals with malignant and benign gynaecological diagnoses.

This is a prospective cohort study evaluating online search data in symptomatic individuals (Google user) referred from primary care (GP) with a suspected cancer to a London Hospital (UK) between December 2020 and June 2022. Informed written consent was obtained and online search data was extracted via Google takeout and anonymised. A health filter was applied to extract health-related terms for 24 months prior to GP referral. A predictive model (outcome: malignancy) was developed using (1) search queries (terms model) and (2) categorised search queries (categories model). Area under the ROC curve (AUC) was used to evaluate model performance. 844 women were approached, 652 were eligible to participate and 392 were recruited. Of those recruited, 108 did not complete enrollment, 12 withdrew and 37 were excluded as they did not track Google searches or had an empty search history, leaving a cohort of 235.s

The cohort had a median age of 53 years old (range 20–81) and a malignancy rate of 26.0%. There was a difference in online search data between those with a benign and malignant diagnosis, noted as early as 360 days in advance of GP referral, when search queries were used directly, but only 60 days in advance, when queries were divided into health categories. A model using online search data from patients (n = 153) who performed health-related search and corrected for sample size, achieved its highest sample-corrected AUC of 0.82, 60 days prior to GP referral.

Online search data appears to be different between individuals with malignant and benign gynaecological conditions, with a signal observed in advance of GP referral date. Online search data needs to be evaluated in a larger dataset to determine its value as an early disease detection tool and whether its use leads to improved clinical outcomes…(More)”.

Responsible Data Re-use in Developing Countries: Social Licence through Public Engagement


Report by Stefaan Verhulst, Laura Sandor, Natalia Mejia Pardo, Elena Murray and Peter Addo: “The datafication era has transformed the technological landscape, digitizing multiple areas of human life and offering opportunities for societal progress through the re-use of digital data. Developing countries stand to benefit from datafication but are faced with challenges like insufficient data quality and limited infrastructure. One of the primary obstacles to unlocking data re-use lies in agency asymmetries—disparities in decision-making authority among stakeholders—which fuel public distrust. Existing consent frameworks amplify the challenge, as they are individual-focused, lack information, and fail to address the nuances of data re-use. To address these limitations, a Social License for re-use becomes imperative—a community-focused approach that fosters responsible data practices and benefits all stakeholders. This shift is crucial for establishing trust and collaboration, and bridging the gap between institutions, governments, and citizens…(More)”.

Central banks use AI to assess climate-related risks


Article by Huw Jones: “Central bankers said on Tuesday they have broken new ground by using artificial intelligence to collect data for assessing climate-related financial risks, just as the volume of disclosures from banks and other companies is set to rise.

The Bank for International Settlements, a forum for central banks, the Bank of Spain, Germany’s Bundesbank and the European Central Bank said their experimental Gaia AI project was used to analyse company disclosures on carbon emissions, green bond issuance and voluntary net-zero commitments.

Regulators of banks, insurers and asset managers need high-quality data to assess the impact of climate-change on financial institutions. However, the absence of a single reporting standard confronts them with a patchwork of public information spread across text, tables and footnotes in annual reports.

Gaia was able to overcome differences in definitions and disclosure frameworks across jurisdictions to offer much-needed transparency, and make it easier to compare indicators on climate-related financial risks, the central banks said in a joint statement.

Despite variations in how the same data is reported by companies, Gaia focuses on the definition of each indicator, rather than how the data is labelled.

Furthermore, with the traditional approach, each additional key performance indicator, or KPI, and each new institution requires the analyst to either search for the information in public corporate reports or contact the institution for information…(More)”.

Data Disquiet: Concerns about the Governance of Data for Generative AI


Paper by Susan Aaronson: “The growing popularity of large language models (LLMs) has raised concerns about their accuracy. These chatbots can be used to provide information, but it may be tainted by errors or made-up or false information (hallucinations) caused by problematic data sets or incorrect assumptions made by the model. The questionable results produced by chatbots has led to growing disquiet among users, developers and policy makers. The author argues that policy makers need to develop a systemic approach to address these concerns. The current piecemeal approach does not reflect the complexity of LLMs or the magnitude of the data upon which they are based, therefore, the author recommends incentivizing greater transparency and accountability around data-set development…(More)”.

God-like: A 500-Year History of Artificial Intelligence in Myths, Machines, Monsters


Book by Kester Brewin: “In the year 1600 a monk is burned at the stake for claiming to have built a device that will allow him to know all things.

350 years later, having witnessed ‘Trinity’ – the first test of the atomic bomb – America’s leading scientist outlines a memory machine that will help end war on earth.

25 years in the making, an ex-soldier finally unveils this ‘machine for augmenting human intellect’, dazzling as he stands ‘Zeus-like, dealing lightning with both hands.’

AI is both stunningly new and rooted in ancient desires. As we finally welcome this ‘god-like’ technology amongst us, what can learn from the myths and monsters of the past about how to survive alongside our greatest ever invention?…(More)”.

Meta to shut off data access to journalists


Article by Sara Fischer: “Meta plans to officially shutter CrowdTangle, the analytics tool widely used by journalists and researchers to see what’s going viral on Facebook and Instagram, the company’s president of global affairs Nick Clegg told Axios in an interview.

Why it matters: The company plans to instead offer select researchers access to a set of new data tools, but news publishers, journalists or anyone with commercial interests will not be granted access to that data.

The big picture: The effort comes amid a broader pivot from Meta away from news and politics and more toward user-generated viral videos.

  • Meta acquired CrowdTangle in 2016 at a time when publishers were heavily reliant on the tech giant for traffic.
  • In recent years, it’s stopped investing in the tool, making it less reliable.

The new research tools include Meta’s Content Library, which it launched last year, and an API, or backend interface used by developers.

  • Both tools offer researchers access to huge swaths of data from publicly accessible content across Facebook and Instagram.
  • The tools are available in 180 languages and offer global data.
  • Researchers must apply for access to those tools through the Inter-university Consortium for Political and Social Research at the University of Michigan, which will vet their requests…(More)”

A typology of artificial intelligence data work


Article by James Muldoon et al: “This article provides a new typology for understanding human labour integrated into the production of artificial intelligence systems through data preparation and model evaluation. We call these forms of labour ‘AI data work’ and show how they are an important and necessary element of the artificial intelligence production process. We draw on fieldwork with an artificial intelligence data business process outsourcing centre specialising in computer vision data, alongside a decade of fieldwork with microwork platforms, business process outsourcing, and artificial intelligence companies to help dispel confusion around the multiple concepts and frames that encompass artificial intelligence data work including ‘ghost work’, ‘microwork’, ‘crowdwork’ and ‘cloudwork’. We argue that these different frames of reference obscure important differences between how this labour is organised in different contexts. The article provides a conceptual division between the different types of artificial intelligence data work institutions and the different stages of what we call the artificial intelligence data pipeline. This article thus contributes to our understanding of how the practices of workers become a valuable commodity integrated into global artificial intelligence production networks…(More)”.

Riders in the smog


Article by Zuha Siddiqui, Samriddhi Sakuna and Faisal Mahmud: “…To better understand air quality exposure among gig workers in South Asia, Rest of World gave three gig workers — one each in Lahore, New Delhi, and Dhaka — air quality monitors to wear throughout a regular shift in January. The Atmotube Pro monitors continually tracked their exposure to carcinogenic pollutants — specifically PM1, PM2.5, and PM10 (different sizes of particulate matter), and volatile organic compounds such as benzene and formaldehyde.

The data revealed that all three workers were routinely exposed to hazardous levels of pollutants. For PM2.5, referring to particulates that are 2.5 micrometers in diameter or less — which have been linked to health risks including heart attacks and strokes — all riders were consistently logging exposure levels more than 10 times the World Health Organization’s recommended daily average of 15 micrograms per cubic meter. Manu Sharma, in New Delhi, recorded the highest PM2.5 level of the three riders, hitting 468.3 micrograms per cubic meter around 6 p.m. Lahore was a close second, with Iqbal recording 464.2 micrograms per cubic meter around the same time.

Alongside tracking specific pollutants, the Atmotube Pro gives an overall real-time air quality score (AQS) from 0–100, with zero being the most severely polluted, and 100 being the cleanest. According to Atmo, the company that makes the Atmotube monitors, a reading of 0–20 should be considered a health alert, under which conditions “everyone should avoid all outdoor exertion.” But the three gig workers found their monitors consistently displayed the lowest possible score…(More)”.