Paper by Andres Karjus: “The increasing capacities of large language models (LLMs) present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution proposes a systematic mixed methods framework to harness qualitative analytic expertise, machine scalability, and rigorous quantification, with attention to transparency and replicability. 16 machine-assisted case studies are showcased as proof of concept. Tasks include linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments. LLM (and human) annotations may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, this approach is not intended to replace, but to augment researcher knowledge and skills. With these opportunities in sight, qualitative expertise and the ability to pose insightful questions have arguably never been more critical…(More)”.
Get a rabbit: Don’t trust the numbers ·
Article by John Lanchester: “At a dinner with the American ambassador in 2007, Li Keqiang, future premier of China, said that when he wanted to know what was happening to the country’s economy, he looked at the numbers for electricity use, rail cargo and bank lending. There was no point using the official GDP statistics, Li said, because they are ‘man-made’. That remark, which we know about thanks to WikiLeaks, is fascinating for two reasons. First, because it shows a sly, subtle, worldly humour – a rare glimpse of the sort of thing Chinese Communist Party leaders say in private. Second, because it’s true. A whole strand in contemporary thinking about the production of knowledge is summed up there: data and statistics, all of them, are man-made.
They are also central to modern politics and governance, and the ways we talk about them. That in itself represents a shift. Discussions that were once about values and beliefs – about what a society wants to see when it looks at itself in the mirror – have increasingly turned to arguments about numbers, data, statistics. It is a quirk of history that the politician who introduced this style of debate wasn’t Harold Wilson, the only prime minister to have had extensive training in statistics, but Margaret Thatcher, who thought in terms of values but argued in terms of numbers. Even debates that are ultimately about national identity, such as the referendums about Scottish independence and EU membership, now turn on numbers.
Given the ubiquity of this style of argument, we are nowhere near as attentive to its misuses as we should be. As the House of Commons Treasury Committee said dryly in a 2016 report on the economic debate about EU membership, ‘many of these claims sound factual because they use numbers.’ The best short book about the use and misuse of statistics is Darrell Huff’s How to Lie with Statistics, first published in 1954, a devil’s-advocate guide to the multiple ways in which numbers are misused in advertising, commerce and politics. (Single best tip: ‘up to’ is always a fib. It means somebody did a range of tests and has artfully chosen the most flattering number.) For all its virtues, though, even Huff’s book doesn’t encompass the full range of possibilities for statistical deception. In politics, the numbers in question aren’t just man-made but are often contentious, tendentious or outright fake.
Two fake numbers have been decisively influential in British politics over the baleful last thirteen years. The first was an outright lie: Vote Leave’s assertion that £350 million a week extra ‘for the NHS’ would be available if we left the EU. The real number for the UK’s net contribution to the EU was £110 million, but that didn’t matter, since the crucial thing for the Leave campaign was to make the number the focus of debate. The Treasury Committee said the number was fake, and so did the UK Statistics Authority. This had no, or perhaps even a negative, effect. In politics it doesn’t really matter what the numbers are, so much as whose they are. If people are arguing about your numbers, you’re winning…(More)“.
On the culture of open access: the Sci-hub paradox
Paper by Abdelghani Maddi and David Sapinho: “Shadow libraries, also known as ”pirate libraries”, are online collections of copyrighted publications that have been made available for free without the permission of the copyright holders. They have gradually become key players of scientific knowledge dissemination, despite their illegality in most countries of the world. Many publishers and scientist-editors decry such libraries for their copyright infringement and loss of publication usage information, while some scholars and institutions support them, sometimes in a roundabout way, for their role in reducing inequalities of access to knowledge, particularly in low-income countries. Although there is a wealth of literature on shadow libraries, none of this have focused on its potential role in knowledge dissemination, through the open access movement. Here we analyze how shadow libraries can affect researchers’ citation practices, highlighting some counter-intuitive findings about their impact on the Open Access Citation Advantage (OACA). Based on a large randomized sample, this study first shows that OA publications, including those in fully OA journals, receive more citations than their subscription-based counterparts do. However, the OACA has slightly decreased over the seven last years. The introduction of a distinction between those accessible or not via the Scihub platform among subscription-based suggest that the generalization of its use cancels the positive effect of OA publishing. The results show that publications in fully OA journals are victims of the success of Sci-hub. Thus, paradoxically, although Sci-hub may seem to facilitate access to scientific knowledge, it negatively affects the OA movement as a whole, by reducing the comparative advantage of OA publications in terms of visibility for researchers. The democratization of the use of Sci-hub may therefore lead to a vicious cycle, hindering efforts to develop full OA strategies without proposing a credible and sustainable alternative model for the dissemination of scientific knowledge…(More)”.
Game Changing Tools for Evidence Synthesis: Generative AI, Data and Policy Design
Paper by Geoff Mulgan, and Sarah O’Meara: “Evidence synthesis aims to make sense of huge bodies of evidence from around the world and make it available for busy decision-makers. Google search was in some respects a game changer in that you could quickly find out what was happening in a field – but it turned out to be much less useful for judging which evidence was relevant, reliable or high quality. Now large language models (LLM) and generative AI are offering an alternative to Google and again appear to have the potential to dramatically improve evidence synthesis, in an instant bringing together large bodies of knowledge and making it available to policy-makers, members of parliament or indeed the public.
But again there’s a gap between the promise and the results. ChatGPT is wonderful for producing a rough first draft: but its inputs are often out of date, it can’t distinguish good from bad evidence and its outputs are sometimes made up. So nearly a year after the arrival of ChatGPT we have been investigating how generative AI can be used most effectively, and, linked to that, how new methods can embed evidence into the daily work of governments and provide ways to see if the best available evidence is being used.
We think that these will be game-changers: transforming the everyday life of policy-makers, and making it much easier to mobilise, and assess evidence – especially if human and machine intelligence are combined rather than being seen as alternatives. But they need to be used with care and judgement rather than being panaceas. [Watch IPPO’s recent discussion here]…(More)”.
Missing Persons: The Case of National AI Strategies
Article by Susan Ariel Aaronson and Adam Zable: “Policy makers should inform, consult and involve citizens as part of their efforts to data-driven technologies such as artificial intelligence (AI). Although many users rely on AI systems, they do not understand how these systems use their data to make predictions and recommendations that can affect their daily lives. Over time, if they see their data being misused, users may learn to distrust both the systems and how policy makers regulate them. This paper examines whether officials informed and consulted their citizens as they developed a key aspect of AI policy — national AI strategies. Building on a data set of 68 countries and the European Union, the authors used qualitative methods to examine whether, how and when governments engaged with their citizens on their AI strategies and whether they were responsive to public comment, concluding that policy makers are missing an opportunity to build trust in AI by not using this process to involve a broader cross-section of their constituents…(More)”.
These Prisoners Are Training AI
Article by Morgan Meaker: “…Around the world, millions of so-called “clickworkers” train artificial intelligence models, teaching machines the difference between pedestrians and palm trees, or what combination of words describe violence or sexual abuse. Usually these workers are stationed in the global south, where wages are cheap. OpenAI, for example, uses an outsourcing firm that employs clickworkers in Kenya, Uganda, and India. That arrangement works for American companies, operating in the world’s most widely spoken language, English. But there are not a lot of people in the global south who speak Finnish.
That’s why Metroc turned to prison labor. The company gets cheap, Finnish-speaking workers, while the prison system can offer inmates employment that, it says, prepares them for the digital world of work after their release. Using prisoners to train AI creates uneasy parallels with the kind of low-paid and sometimes exploitive labor that has often existed downstream in technology. But in Finland, the project has received widespread support.
“There’s this global idea of what data labor is. And then there’s what happens in Finland, which is very different if you look at it closely,” says Tuukka Lehtiniemi, a researcher at the University of Helsinki, who has been studying data labor in Finnish prisons.
For four months, Marmalade has lived here, in Hämeenlinna prison. The building is modern, with big windows. Colorful artwork tries to enforce a sense of cheeriness on otherwise empty corridors. If it wasn’t for the heavy gray security doors blocking every entry and exit, these rooms could easily belong to a particularly soulless school or university complex.
Finland might be famous for its open prisons—where inmates can work or study in nearby towns—but this is not one of them. Instead, Hämeenlinna is the country’s highest-security institution housing exclusively female inmates. Marmalade has been sentenced to six years. Under privacy rules set by the prison, WIRED is not able to publish Marmalade’s real name, exact age, or any other information that could be used to identify her. But in a country where prisoners serving life terms can apply to be released after 12 years, six years is a heavy sentence. And like the other 100 inmates who live here, she is not allowed to leave…(More)”.
Open Science and Data Protection: Engaging Scientific and Legal Contexts
Editorial Paper of Special Issue edited by Ludovica Paseri: “This paper analyses the relationship between open science policies and data protection. In order to tackle the research data paradox of the contemporary science, i.e., the tension between the pursuit of data-driven scientific research and the crisis of repeatability or reproducibility of science, a theoretical perspective suggests a potential convergence between open science and data protection. Both fields regard governance mechanisms that shall take into account the plurality of interests at stake. The aim is to shed light on the processing of personal data for scientific research purposes in the context of open science. The investigation supports a threefold need: that of broadening the legal debate; of expanding the territorial scope of the analysis, in addition to the extra-territoriality effects of the European Union’s law; and an interdisciplinary discussion. Based on these needs, four perspectives are then identified, that encompass the challenges related to data processing in the context of open science: (i) the contextual and epistemological perspectives; (ii) the legal coordination perspectives; (iii) the governance perspectives; and (iv) the technical perspectives…(More)”.
Initial policy considerations for generative artificial intelligence
OECD Report: “Generative artificial intelligence (AI) creates new content in response to prompts, offering transformative potential across multiple sectors such as education, entertainment, healthcare and scientific research. However, these technologies also pose critical societal and policy challenges that policy makers must confront: potential shifts in labour markets, copyright uncertainties, and risk associated with the perpetuation of societal biases and the potential for misuse in the creation of disinformation and manipulated content. Consequences could extend to the spreading of mis- and disinformation, perpetuation of discrimination, distortion of public discourse and markets, and the incitement of violence. Governments recognise the transformative impact of generative AI and are actively working to address these challenges. This paper aims to inform these policy considerations and support decision makers in addressing them…(More)”.
More Companies Are Disclosing Their ESG Data, but Confusion on How Persists
Article by David Breg: “Public companies in the U.S. are increasingly disclosing sustainability information, but many say they find it a challenge to report fundamental climate data that many regulators around the globe likely will require under incoming mandatory reporting standards.
Nearly two-thirds of respondents said their company was disclosing environmental, social and governance information, up from 56% in the prior year, according to the annual survey of sustainability officials that WSJ Pro conducted this spring.
However, there was little consensus on which framework to use and respondents highlighted three fundamental types of information as their three biggest environmental reporting challenges: Greenhouse-gas emissions, climate-change risk and energy management.
The proportion of companies disclosing sustainability and ESG information was 63%, up from 56% last year. Those that don’t yet report this data but plan to was 16%, down from 25% last year. About one-fifth of respondents said their organization had no plans to report their progress, virtually unchanged from last year. Breaking that down, a quarter of private companies don’t plan any ESG reporting, while only 7% of public companies felt the same.
Regulators around the globe are finalizing rules that would require companies to publish standardized information after years of patchy voluntary ESG reporting based on a host of frameworks. California’s governor has said he would soon sign that state’s requirements into law. The U.S. Securities and Exchange Commission’s rules are expected later this year. European regulations are already in place and many other countries are also working on standards. The International Sustainability Standards Board hopes its climate framework, completed this past summer, becomes the global baseline.
While it is mostly public companies that face mandatory requirements, even private businesses face increased scrutiny of their sustainability and ESG policies from stakeholders including shareholders, eco-conscious consumers, suppliers, insurers and lenders…(More)”.
Surveys Provide Insight Into Three Factors That Encourage Open Data and Science
Article by Joshua Borycz, Alison Specht and Kevin Crowston: “Open Science is a game changer for researchers and the research community. The UNESCO Open Science recommendations in 2021 suggest that the practice of Open Science is a win-win for researchers as they gain from others’ work while making contributions, which in turn benefits the community, as transparency of conclusions and hence confidence in new knowledge improves.
Over a 10-year period Carol Tenopir of DataONE and her team conducted a global survey of scientists, managers and government workers involved in broad environmental science activities about their willingness to share data and their opinion of the resources available to do so (Tenopir et al., 2011, 2015, 2018, 2020). Comparing the responses over that time shows a general increase in the willingness to share data (and thus engage in open science).
A higher willingness to share data corresponded with a decrease in satisfaction with data sharing resources across nations.
The most surprising result was that a higher willingness to share data corresponded with a decrease in satisfaction with data sharing resources across nations (e.g., skills, tools, training) (Fig.1). That is, researchers who did not want to share data were satisfied with the available resources, and those that did want to share data were dissatisfied. Researchers appear to only discover that the tools are insufficient when they begin the hard work of engaging in open science practices. This indicates that a cultural shift in the attitudes of researchers needs to precede the development of support and tools for data management…(More)”.