Feeding the Machine: The Hidden Human Labor Powering A.I.


Book by Mark Graham, Callum Cant, and James Muldoon: “Silicon Valley has sold us the illusion that artificial intelligence is a frictionless technology that will bring wealth and prosperity to humanity. But hidden beneath this smooth surface lies the grim reality of a precarious global workforce of millions laboring under often appalling conditions to make A.I. possible. This book presents an urgent, riveting investigation of the intricate network that maintains this exploitative system, revealing the untold truth of A.I.

Based on hundreds of interviews and thousands of hours of fieldwork over more than a decade, Feeding the Machine describes the lives of the workers deliberately concealed from view, and the power structures that determine their future. It gives voice to the people whom A.I. exploits, from accomplished writers and artists to the armies of data annotators, content moderators and warehouse workers, revealing how their dangerous, low-paid labor is connected to longer histories of gendered, racialized, and colonial exploitation.

A.I. is an extraction machine that feeds off humanity’s collective effort and intelligence, churning through ever-larger datasets to power its algorithms. This book is a call to arms that details what we need to do to fight for a more just digital future…(More)”.

AI firms will soon exhaust most of the internet’s data


Article by The Economist: “One approach is to focus on data quality rather than quantity. ai labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an ai firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.

These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like Openai’s gpt-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.

Whatever models are used, ownership is increasingly recognised as an issue. The material used in training llms is often copyrighted and used without consent from, or payment to, the rights holders. Some ai models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. ai models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”…

It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in ai. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives…(More)”.

Rethinking Dual-Use Technology


Article by Artur Kluz and Stefaan Verhulst: “A new concept of “triple use” — where technology serves commercial, defense, and peacebuilding purposes — may offer a breakthrough solution for founders, investors and society to explore….

As a result of the resurgence of geopolitical tensions, the debate about the applications of dual-use technology is intensifying. The core issue founders, tech entrepreneurs, venture capitalists (VCs), and limited partner investors (LPs) are examining is whether commercial technologies should increasingly be re-used for military purposes. Traditionally, the majority of  investors (including limited partners) have prohibited dual-use tech in their agreements. However, the rapidly growing dual-use market, with its substantial addressable size and growth potential, is compelling all stakeholders to reconsider this stance. The pressure for innovations, capital returns and Return On Investment (ROI) is driving the need for a solution. 

These discussions are fraught with moral complexity, but they also present an opportunity to rethink the dual-use paradigm and foster investment in technologies aimed at supporting peace. A new concept of “triple use”— where technology serves commercial, defense, and peacebuilding purposes — may offer an innovative and more positive avenue for founders, investors and society to explore. This additional re-use, which remains in an incipient state, is increasingly being referred to as PeaceTech. By integrating terms dedicated to PeaceTech in new and existing investment and LP agreements, tech companies, founders and venture capital investors can be also required to apply their technology for peacebuilding purposes. This approach can expand the applications of emerging technologies to also include conflict prevention, reconstruction or any humanitarian aspects.

However, current efforts to use technologies for peacebuilding are impeded by various obstacles, including a lack of awareness within the tech sector and among investors, limited commercial interest, disparities in technical capacity, privacy concerns, international relations and political complexities. In the below we examine some of these challenges, while also exploring certain avenues for overcoming them — including approaching technologies for peace as a “triple use” application. We will especially try to identify examples of how tech companies, tech entrepreneurs, accelerators, and tech investors including VCs and LPs can commercially benefit and support “triple use” technologies. Ultimately, we argue, the vast potential — largely untapped — of “triple use” technologies calls for a new wave of tech ecosystem transformation and public and private investments as well as the development of a new field of research…(More)”.

The Risks of Empowering “Citizen Data Scientists”


Article by Reid Blackman and Tamara Sipes: “Until recently, the prevailing understanding of artificial intelligence (AI) and its subset machine learning (ML) was that expert data scientists and AI engineers were the only people that could push AI strategy and implementation forward. That was a reasonable view. After all, data science generally, and AI in particular, is a technical field requiring, among other things, expertise that requires many years of education and training to obtain.

Fast forward to today, however, and the conventional wisdom is rapidly changing. The advent of “auto-ML” — software that provides methods and processes for creating machine learning code — has led to calls to “democratize” data science and AI. The idea is that these tools enable organizations to invite and leverage non-data scientists — say, domain data experts, team members very familiar with the business processes, or heads of various business units — to propel their AI efforts.

In theory, making data science and AI more accessible to non-data scientists (including technologists who are not data scientists) can make a lot of business sense. Centralized and siloed data science units can fail to appreciate the vast array of data the organization has and the business problems that it can solve, particularly with multinational organizations with hundreds or thousands of business units distributed across several continents. Moreover, those in the weeds of business units know the data they have, the problems they’re trying to solve, and can, with training, see how that data can be leveraged to solve those problems. The opportunities are significant.

In short, with great business insight, augmented with auto-ML, can come great analytic responsibility. At the same time, we cannot forget that data science and AI are, in fact, very difficult, and there’s a very long journey from having data to solving a problem. In this article, we’ll lay out the pros and cons of integrating citizen data scientists into your AI strategy and suggest methods for optimizing success and minimizing risks…(More)”.

Anonymization: The imperfect science of using data while preserving privacy


Paper by Andrea Gadotti et al: “Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today…(More)”.

The Tech Coup


Book by Marietje Schaake: “Over the past decades, under the cover of “innovation,” technology companies have successfully resisted regulation and have even begun to seize power from governments themselves. Facial recognition firms track citizens for police surveillance. Cryptocurrency has wiped out the personal savings of millions and threatens the stability of the global financial system. Spyware companies sell digital intelligence tools to anyone who can afford them. This new reality—where unregulated technology has become a forceful instrument for autocrats around the world—is terrible news for democracies and citizens.
In The Tech Coup, Marietje Schaake offers a behind-the-scenes account of how technology companies crept into nearly every corner of our lives and our governments. She takes us beyond the headlines to high-stakes meetings with human rights defenders, business leaders, computer scientists, and politicians to show how technologies—from social media to artificial intelligence—have gone from being heralded as utopian to undermining the pillars of our democracies. To reverse this existential power imbalance, Schaake outlines game-changing solutions to empower elected officials and citizens alike. Democratic leaders can—and must—resist the influence of corporate lobbying and reinvent themselves as dynamic, flexible guardians of our digital world.

Drawing on her experiences in the halls of the European Parliament and among Silicon Valley insiders, Schaake offers a frightening look at our modern tech-obsessed world—and a clear-eyed view of how democracies can build a better future before it is too late…(More)”.

The impact of data portability on user empowerment, innovation, and competition


OECD Note: “Data portability enhances access to and sharing of data across digital services and platforms. It can empower users to play a more active role in the re-use of their data and can help stimulate competition and innovation by fostering interoperability while reducing switching costs and lock-in effects. However, the effectiveness of data portability in enhancing competition depends on the terms and conditions of data transfer and the extent to which competitors can make use of the data effectively. Additionally, there are potential downsides: data portability measures may unintentionally stifle competition in fast-evolving markets where interoperability requirements may disproportionately burden SMEs and start-ups. Data portability can also increase digital security and privacy risks by enabling data transfers to multiple destinations. This note presents the following five dimensions essential for designing and implementing data portability frameworks: sectoral scope; beneficiaries; type of data; legal obligations; and operational modality…(More)”.

Community consent: neither a ceiling nor a floor


Article by Jasmine McNealy: “The 23andMe breach and the Golden State Killer case are two of the more “flashy” cases, but questions of consent, especially the consent of all of those affected by biodata collection and analysis in more mundane or routine health and medical research projects, are just as important. The communities of people affected have expectations about their privacy and the possible impacts of inferences that could be made about them in data processing systems. Researchers must, then, acquire community consent when attempting to work with networked biodata. 

Several benefits of community consent exist, especially for marginalized and vulnerable populations. These benefits include:

  • Ensuring that information about the research project spreads throughout the community,
  • Removing potential barriers that might be created by resistance from community members,
  • Alleviating the possible concerns of individuals about the perspectives of community leaders, and 
  • Allowing the recruitment of participants using methods most salient to the community.

But community consent does not replace individual consent and limits exist for both community and individual consent. Therefore, within the context of a biorepository, understanding whether community consent might be a ceiling or a floor requires examining governance and autonomy…(More)”.

The Data That Powers A.I. Is Disappearing Fast


Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

Governance of deliberative mini-publics: emerging consensus and divergent views


Paper by Lucy J. Parry, Nicole Curato, and , and John S. Dryzek: “Deliberative mini-publics are forums for citizen deliberation composed of randomly selected citizens convened to yield policy recommendations. These forums have proliferated in recent years but there are no generally accepted standards to govern their practice. Should there be? We answer this question by bringing the scholarly literature on citizen deliberation into dialogue with the lived experience of the people who study, design and implement mini-publics. We use Q methodology to locate five distinct perspectives on the integrity of mini-publics, and map the structure of agreement and dispute across them. We find that, across the five viewpoints, there is emerging consensus as well as divergence on integrity issues, with disagreement over what might be gained or lost by adapting common standards of practice, and possible sources of integrity risks. This article provides an empirical foundation for further discussion on integrity standards in the future…(More)”.