Essential requirements for the governance and management of data trusts, data repositories, and other data collaborations


Paper by Alison Paprica et al: “Around the world, many organisations are working on ways to increase the use, sharing, and reuse of person-level data for research, evaluation, planning, and innovation while ensuring that data are secure and privacy is protected. As a contribution to broader efforts to improve data governance and management, in 2020 members of our team published 12 minimum specification essential requirements (min specs) to provide practical guidance for organisations establishing or operating data trusts and other forms of data infrastructure… We convened an international team, consisting mostly of participants from Canada and the United States of America, to test and refine the original 12 min specs. Twenty-three (23) data-focused organisations and initiatives recorded the various ways they address the min specs. Sub-teams analysed the results, used the findings to make improvements to the min specs, and identified materials to support organisations/initiatives in addressing the min specs.
Analyses and discussion led to an updated set of 15 min specs covering five categories: one min spec for Legal, five for Governance, four for Management, two for Data Users, and three for Stakeholder & Public Engagement. Multiple changes were made to make the min specs language more technically complete and precise. The updated set of 15 min specs has been integrated into a Canadian national standard that, to our knowledge, is the first to include requirements for public engagement and Indigenous Data Sovereignty…(More)”.

What Big Tech Knows About Your Body


Article by Yael Grauer: “If you were seeking online therapy from 2017 to 2021—and a lot of people were—chances are good that you found your way to BetterHelp, which today describes itself as the world’s largest online-therapy purveyor, with more than 2 million users. Once you were there, after a few clicks, you would have completed a form—an intake questionnaire, not unlike the paper one you’d fill out at any therapist’s office: Are you new to therapy? Are you taking any medications? Having problems with intimacy? Experiencing overwhelming sadness? Thinking of hurting yourself? BetterHelp would have asked you if you were religious, if you were LGBTQ, if you were a teenager. These questions were just meant to match you with the best counselor for your needs, small text would have assured you. Your information would remain private.

Except BetterHelp isn’t exactly a therapist’s office, and your information may not have been completely private. In fact, according to a complaint brought by federal regulators, for years, BetterHelp was sharing user data—including email addresses, IP addresses, and questionnaire answers—with third parties, including Facebook and Snapchat, for the purposes of targeting ads for its services. It was also, according to the Federal Trade Commission, poorly regulating what those third parties did with users’ data once they got them. In July, the company finalized a settlement with the FTC and agreed to refund $7.8 million to consumers whose privacy regulators claimed had been compromised. (In a statement, BetterHelp admitted no wrongdoing and described the alleged sharing of user information as an “industry-standard practice.”)

We leave digital traces about our health everywhere we go: by completing forms like BetterHelp’s. By requesting a prescription refill online. By clicking on a link. By asking a search engine about dosages or directions to a clinic or pain in chest dying. By shopping, online or off. By participating in consumer genetic testing. By stepping on a smart scale or using a smart thermometer. By joining a Facebook group or a Discord server for people with a certain medical condition. By using internet-connected exercise equipment. By using an app or a service to count your steps or track your menstrual cycle or log your workouts. Even demographic and financial data unrelated to health can be aggregated and analyzed to reveal or infer sensitive information about people’s physical or mental-health conditions…(More)”.

The Man Who Trapped Us in Databases


McKenzie Funk in The New York University: “One of Asher’s innovations — or more precisely one of his companies’ innovations — was what is now known as the LexID. My LexID, I learned, is 000874529875. This unique string of digits is a kind of shadow Social Security number, one of many such “persistent identifiers,” as they are called, that have been issued not by the government but by data companies like Acxiom, Oracle, Thomson Reuters, TransUnion — or, in this case, LexisNexis.

My LexID was created sometime in the early 2000s in Asher’s computer room in South Florida, as many still are, and without my consent it began quietly stalking me. One early data point on me would have been my name; another, my parents’ address in Oregon. From my birth certificate or my driver’s license or my teenage fishing license — and from the fact that the three confirmed one another — it could get my sex and my date of birth. At the time, it would have been able to collect the address of the college I attended, Swarthmore, which was small and expensive, and it would have found my first full-time employer, the National Geographic Society, quickly amassing more than enough data to let someone — back then, a human someone — infer quite a bit more about me and my future prospects…(More)”

Data Repurposing through Compatibility: A Computational Perspective


Paper by Asia Biega: “Reuse of data in new contexts beyond the purposes for which it was originally collected has contributed to technological innovation and reducing the consent burden on data subjects. One of the legal mechanisms that makes such reuse possible is purpose compatibility assessment. In this paper, I offer an in-depth analysis of this mechanism through a computational lens. I moreover consider what should qualify as repurposing apart from using data for a completely new task, and argue that typical purpose formulations are an impediment to meaningful repurposing. Overall, the paper positions compatibility assessment as a constructive practice beyond an ineffective standard…(More)”

Machine-assisted mixed methods: augmenting humanities and social sciences with artificial intelligence


Paper by Andres Karjus: “The increasing capacities of large language models (LLMs) present an unprecedented opportunity to scale up data analytics in the humanities and social sciences, augmenting and automating qualitative analytic tasks previously typically allocated to human labor. This contribution proposes a systematic mixed methods framework to harness qualitative analytic expertise, machine scalability, and rigorous quantification, with attention to transparency and replicability. 16 machine-assisted case studies are showcased as proof of concept. Tasks include linguistic and discourse analysis, lexical semantic change detection, interview analysis, historical event cause inference and text mining, detection of political stance, text and idea reuse, genre composition in literature and film; social network inference, automated lexicography, missing metadata augmentation, and multimodal visual cultural analytics. In contrast to the focus on English in the emerging LLM applicability literature, many examples here deal with scenarios involving smaller languages and historical texts prone to digitization distortions. In all but the most difficult tasks requiring expert knowledge, generative LLMs can demonstrably serve as viable research instruments. LLM (and human) annotations may contain errors and variation, but the agreement rate can and should be accounted for in subsequent statistical modeling; a bootstrapping approach is discussed. The replications among the case studies illustrate how tasks previously requiring potentially months of team effort and complex computational pipelines, can now be accomplished by an LLM-assisted scholar in a fraction of the time. Importantly, this approach is not intended to replace, but to augment researcher knowledge and skills. With these opportunities in sight, qualitative expertise and the ability to pose insightful questions have arguably never been more critical…(More)”.

Get a rabbit: Don’t trust the numbers · 


Article by John Lanchester: “At a dinner​ with the American ambassador in 2007, Li Keqiang, future premier of China, said that when he wanted to know what was happening to the country’s economy, he looked at the numbers for electricity use, rail cargo and bank lending. There was no point using the official GDP statistics, Li said, because they are ‘man-made’. That remark, which we know about thanks to WikiLeaks, is fascinating for two reasons. First, because it shows a sly, subtle, worldly humour – a rare glimpse of the sort of thing Chinese Communist Party leaders say in private. Second, because it’s true. A whole strand in contemporary thinking about the production of knowledge is summed up there: data and statistics, all of them, are man-made.

They are also central to modern politics and governance, and the ways we talk about them. That in itself represents a shift. Discussions that were once about values and beliefs – about what a society wants to see when it looks at itself in the mirror – have increasingly turned to arguments about numbers, data, statistics. It is a quirk of history that the politician who introduced this style of debate wasn’t Harold Wilson, the only prime minister to have had extensive training in statistics, but Margaret Thatcher, who thought in terms of values but argued in terms of numbers. Even debates that are ultimately about national identity, such as the referendums about Scottish independence and EU membership, now turn on numbers.

Given the ubiquity of this style of argument, we are nowhere near as attentive to its misuses as we should be. As the House of Commons Treasury Committee said dryly in a 2016 report on the economic debate about EU membership, ‘many of these claims sound factual because they use numbers.’ The best short book about the use and misuse of statistics is Darrell Huff’s How to Lie with Statistics, first published in 1954, a devil’s-advocate guide to the multiple ways in which numbers are misused in advertising, commerce and politics. (Single best tip: ‘up to’ is always a fib. It means somebody did a range of tests and has artfully chosen the most flattering number.) For all its virtues, though, even Huff’s book doesn’t encompass the full range of possibilities for statistical deception. In politics, the numbers in question aren’t just man-made but are often contentious, tendentious or outright fake.

Two fake numbers have been decisively influential in British politics over the baleful last thirteen years. The first was an outright lie: Vote Leave’s assertion that £350 million a week extra ‘for the NHS’ would be available if we left the EU. The real number for the UK’s net contribution to the EU was £110 million, but that didn’t matter, since the crucial thing for the Leave campaign was to make the number the focus of debate. The Treasury Committee said the number was fake, and so did the UK Statistics Authority. This had no, or perhaps even a negative, effect. In politics it doesn’t really matter what the numbers are, so much as whose they are. If people are arguing about your numbers, you’re winning…(More)“.

On the culture of open access: the Sci-hub paradox


Paper by Abdelghani Maddi and David Sapinho: “Shadow libraries, also known as ”pirate libraries”, are online collections of copyrighted publications that have been made available for free without the permission of the copyright holders. They have gradually become key players of scientific knowledge dissemination, despite their illegality in most countries of the world. Many publishers and scientist-editors decry such libraries for their copyright infringement and loss of publication usage information, while some scholars and institutions support them, sometimes in a roundabout way, for their role in reducing inequalities of access to knowledge, particularly in low-income countries. Although there is a wealth of literature on shadow libraries, none of this have focused on its potential role in knowledge dissemination, through the open access movement. Here we analyze how shadow libraries can affect researchers’ citation practices, highlighting some counter-intuitive findings about their impact on the Open Access Citation Advantage (OACA). Based on a large randomized sample, this study first shows that OA publications, including those in fully OA journals, receive more citations than their subscription-based counterparts do. However, the OACA has slightly decreased over the seven last years. The introduction of a distinction between those accessible or not via the Scihub platform among subscription-based suggest that the generalization of its use cancels the positive effect of OA publishing. The results show that publications in fully OA journals are victims of the success of Sci-hub. Thus, paradoxically, although Sci-hub may seem to facilitate access to scientific knowledge, it negatively affects the OA movement as a whole, by reducing the comparative advantage of OA publications in terms of visibility for researchers. The democratization of the use of Sci-hub may therefore lead to a vicious cycle, hindering efforts to develop full OA strategies without proposing a credible and sustainable alternative model for the dissemination of scientific knowledge…(More)”.

Game Changing Tools for Evidence Synthesis: Generative AI, Data and Policy Design


Paper by Geoff Mulgan, and Sarah O’Meara: “Evidence synthesis aims to make sense of huge bodies of evidence from around the world and make it available for busy decision-makers. Google search was in some respects a game changer in that you could quickly find out what was happening in a field – but it turned out to be much less useful for judging which evidence was relevant, reliable or high quality. Now large language models (LLM) and generative AI are offering an alternative to Google and again appear to have the potential to dramatically improve evidence synthesis, in an instant bringing together large bodies of knowledge and making it available to policy-makers, members of parliament or indeed the public. 

But again there’s a gap between the promise and the results. ChatGPT is wonderful for producing a rough first draft: but its inputs are often out of date, it can’t distinguish good from bad evidence and its outputs are sometimes made up.  So nearly a year after the arrival of ChatGPT we have been investigating how generative AI can be used most effectively, and, linked to that, how new methods can embed evidence into the daily work of governments and provide ways to see if the best available evidence is being used.

We think that these will be game-changers: transforming the everyday life of policy-makers, and making it much easier to mobilise, and assess evidence – especially if human and machine intelligence are combined rather than being seen as alternatives. But they need to be used with care and judgement rather than being panaceas. [Watch IPPO’s recent discussion here]…(More)”.

Missing Persons: The Case of National AI Strategies


Article by Susan Ariel Aaronson and Adam Zable: “Policy makers should inform, consult and involve citizens as part of their efforts to data-driven technologies such as artificial intelligence (AI). Although many users rely on AI systems, they do not understand how these systems use their data to make predictions and recommendations that can affect their daily lives. Over time, if they see their data being misused, users may learn to distrust both the systems and how policy makers regulate them. This paper examines whether officials informed and consulted their citizens as they developed a key aspect of AI policy — national AI strategies. Building on a data set of 68 countries and the European Union, the authors used qualitative methods to examine whether, how and when governments engaged with their citizens on their AI strategies and whether they were responsive to public comment, concluding that policy makers are missing an opportunity to build trust in AI by not using this process to involve a broader cross-section of their constituents…(More)”.

These Prisoners Are Training AI


Article by Morgan Meaker: “…Around the world, millions of so-called “clickworkers” train artificial intelligence models, teaching machines the difference between pedestrians and palm trees, or what combination of words describe violence or sexual abuse. Usually these workers are stationed in the global south, where wages are cheap. OpenAI, for example, uses an outsourcing firm that employs clickworkers in Kenya, Uganda, and India. That arrangement works for American companies, operating in the world’s most widely spoken language, English. But there are not a lot of people in the global south who speak Finnish.

That’s why Metroc turned to prison labor. The company gets cheap, Finnish-speaking workers, while the prison system can offer inmates employment that, it says, prepares them for the digital world of work after their release. Using prisoners to train AI creates uneasy parallels with the kind of low-paid and sometimes exploitive labor that has often existed downstream in technology. But in Finland, the project has received widespread support.

“There’s this global idea of what data labor is. And then there’s what happens in Finland, which is very different if you look at it closely,” says Tuukka Lehtiniemi, a researcher at the University of Helsinki, who has been studying data labor in Finnish prisons.

For four months, Marmalade has lived here, in Hämeenlinna prison. The building is modern, with big windows. Colorful artwork tries to enforce a sense of cheeriness on otherwise empty corridors. If it wasn’t for the heavy gray security doors blocking every entry and exit, these rooms could easily belong to a particularly soulless school or university complex.

Finland might be famous for its open prisons—where inmates can work or study in nearby towns—but this is not one of them. Instead, Hämeenlinna is the country’s highest-security institution housing exclusively female inmates. Marmalade has been sentenced to six years. Under privacy rules set by the prison, WIRED is not able to publish Marmalade’s real name, exact age, or any other information that could be used to identify her. But in a country where prisoners serving life terms can apply to be released after 12 years, six years is a heavy sentence. And like the other 100 inmates who live here, she is not allowed to leave…(More)”.