When A.I. Fails the Language Test, Who Is Left Out of the Conversation?


Article by Sara Ruberg: “While the use of A.I. has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. A.I. experts worry that the language gap could exacerbate technological inequities, and that it could leave many regions and cultures behind.

A delay of access to good technology of even a few years, “can potentially lead to a few decades of economic delay,” said Sang Truong, a Ph.D. candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

The tests his team ran found that A.I. tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the A.I. model to learn from.

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because A.I. tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.

An analysis of top websites by W3Techs, a tech survey company, found that English makes up over 60 percent of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5 percent of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

Academic institutions, grass-roots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.

Lelapa AI, based in Johannesburg, is one such company leading efforts on the African continent. The South African-based start-up is developing multilingual A.I. products for people and businesses in Africa…(More)”.

Feeding the Machine: The Hidden Human Labor Powering A.I.


Book by Mark Graham, Callum Cant, and James Muldoon: “Silicon Valley has sold us the illusion that artificial intelligence is a frictionless technology that will bring wealth and prosperity to humanity. But hidden beneath this smooth surface lies the grim reality of a precarious global workforce of millions laboring under often appalling conditions to make A.I. possible. This book presents an urgent, riveting investigation of the intricate network that maintains this exploitative system, revealing the untold truth of A.I.

Based on hundreds of interviews and thousands of hours of fieldwork over more than a decade, Feeding the Machine describes the lives of the workers deliberately concealed from view, and the power structures that determine their future. It gives voice to the people whom A.I. exploits, from accomplished writers and artists to the armies of data annotators, content moderators and warehouse workers, revealing how their dangerous, low-paid labor is connected to longer histories of gendered, racialized, and colonial exploitation.

A.I. is an extraction machine that feeds off humanity’s collective effort and intelligence, churning through ever-larger datasets to power its algorithms. This book is a call to arms that details what we need to do to fight for a more just digital future…(More)”.

AI firms will soon exhaust most of the internet’s data


Article by The Economist: “One approach is to focus on data quality rather than quantity. ai labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an ai firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.

These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like Openai’s gpt-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.

Whatever models are used, ownership is increasingly recognised as an issue. The material used in training llms is often copyrighted and used without consent from, or payment to, the rights holders. Some ai models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. ai models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”…

It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in ai. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives…(More)”.

Rethinking Dual-Use Technology


Article by Artur Kluz and Stefaan Verhulst: “A new concept of “triple use” — where technology serves commercial, defense, and peacebuilding purposes — may offer a breakthrough solution for founders, investors and society to explore….

As a result of the resurgence of geopolitical tensions, the debate about the applications of dual-use technology is intensifying. The core issue founders, tech entrepreneurs, venture capitalists (VCs), and limited partner investors (LPs) are examining is whether commercial technologies should increasingly be re-used for military purposes. Traditionally, the majority of  investors (including limited partners) have prohibited dual-use tech in their agreements. However, the rapidly growing dual-use market, with its substantial addressable size and growth potential, is compelling all stakeholders to reconsider this stance. The pressure for innovations, capital returns and Return On Investment (ROI) is driving the need for a solution. 

These discussions are fraught with moral complexity, but they also present an opportunity to rethink the dual-use paradigm and foster investment in technologies aimed at supporting peace. A new concept of “triple use”— where technology serves commercial, defense, and peacebuilding purposes — may offer an innovative and more positive avenue for founders, investors and society to explore. This additional re-use, which remains in an incipient state, is increasingly being referred to as PeaceTech. By integrating terms dedicated to PeaceTech in new and existing investment and LP agreements, tech companies, founders and venture capital investors can be also required to apply their technology for peacebuilding purposes. This approach can expand the applications of emerging technologies to also include conflict prevention, reconstruction or any humanitarian aspects.

However, current efforts to use technologies for peacebuilding are impeded by various obstacles, including a lack of awareness within the tech sector and among investors, limited commercial interest, disparities in technical capacity, privacy concerns, international relations and political complexities. In the below we examine some of these challenges, while also exploring certain avenues for overcoming them — including approaching technologies for peace as a “triple use” application. We will especially try to identify examples of how tech companies, tech entrepreneurs, accelerators, and tech investors including VCs and LPs can commercially benefit and support “triple use” technologies. Ultimately, we argue, the vast potential — largely untapped — of “triple use” technologies calls for a new wave of tech ecosystem transformation and public and private investments as well as the development of a new field of research…(More)”.

The Risks of Empowering “Citizen Data Scientists”


Article by Reid Blackman and Tamara Sipes: “Until recently, the prevailing understanding of artificial intelligence (AI) and its subset machine learning (ML) was that expert data scientists and AI engineers were the only people that could push AI strategy and implementation forward. That was a reasonable view. After all, data science generally, and AI in particular, is a technical field requiring, among other things, expertise that requires many years of education and training to obtain.

Fast forward to today, however, and the conventional wisdom is rapidly changing. The advent of “auto-ML” — software that provides methods and processes for creating machine learning code — has led to calls to “democratize” data science and AI. The idea is that these tools enable organizations to invite and leverage non-data scientists — say, domain data experts, team members very familiar with the business processes, or heads of various business units — to propel their AI efforts.

In theory, making data science and AI more accessible to non-data scientists (including technologists who are not data scientists) can make a lot of business sense. Centralized and siloed data science units can fail to appreciate the vast array of data the organization has and the business problems that it can solve, particularly with multinational organizations with hundreds or thousands of business units distributed across several continents. Moreover, those in the weeds of business units know the data they have, the problems they’re trying to solve, and can, with training, see how that data can be leveraged to solve those problems. The opportunities are significant.

In short, with great business insight, augmented with auto-ML, can come great analytic responsibility. At the same time, we cannot forget that data science and AI are, in fact, very difficult, and there’s a very long journey from having data to solving a problem. In this article, we’ll lay out the pros and cons of integrating citizen data scientists into your AI strategy and suggest methods for optimizing success and minimizing risks…(More)”.

Anonymization: The imperfect science of using data while preserving privacy


Paper by Andrea Gadotti et al: “Information about us, our actions, and our preferences is created at scale through surveys or scientific studies or as a result of our interaction with digital devices such as smartphones and fitness trackers. The ability to safely share and analyze such data is key for scientific and societal progress. Anonymization is considered by scientists and policy-makers as one of the main ways to share data while minimizing privacy risks. In this review, we offer a pragmatic perspective on the modern literature on privacy attacks and anonymization techniques. We discuss traditional de-identification techniques and their strong limitations in the age of big data. We then turn our attention to modern approaches to share anonymous aggregate data, such as data query systems, synthetic data, and differential privacy. We find that, although no perfect solution exists, applying modern techniques while auditing their guarantees against attacks is the best approach to safely use and share data today…(More)”.

Policy fit for the future: the Australian Government Futures primer


Primer by Will Hartigan and Arthur Horobin: “Futures is a systematic exploration of probable, possible and preferable future developments to inform present-day policy, strategy and decision-making. It uses multiple plausible scenarios of the future to anticipate and make sense of disruptive change. It is also known as strategic foresight...

This primer provides an overview of Futures methodologies and their practical application to policy development and advice. It is a first step for policy teams and officers interested in Futures: providing you with a range of flexible tools, ideas and advice you can adapt to your own policy challenges and environments.

This primer was developed by the Policy Projects and Taskforce Office in the Department of Prime Minister and Cabinet. We have drawn on expertise from inside and outside of government –including through our project partners, the Futures Hub at the National Security College in the Australian National University. 

This primer has been written by policy officers, for policy officers –with a focus on practical and tested approaches that can support you to create policy fit for the future…(More)”.

Training LLMs to Draft Replies to Parliamentary Questions


Blog by Watson Chua: “In Singapore, the government is answerable to Parliament and Members of Parliament (MPs) may raise queries to any Minister on any matter in his portfolio. These questions can be answered orally during the Parliament sitting or through a written reply. Regardless of the medium, public servants in the ministries must gather materials to answer the question and prepare a response.

Generative AI and Large Language Models (LLMs) have already been applied to help public servants do this more effectively and efficiently. For example, Pair Search (publicly accessible) and the Hansard Analysis Tool (only accessible to public servants) help public servants search for relevant information in past Parliamentary Sittings relevant to the question and synthesise a response to it.

The existing systems draft the responses using prompt engineering and Retrieval Augmented Generation (RAG). To recap, RAG consists of two main parts:

  • Retriever: A search engine that finds documents relevant to the question
  • Generator: A text generation model (LLM) that takes in the instruction, the question, and the search results from the retriever to respond to the question
A typical RAG system. Illustration by Hrishi Olickel, taken from here.

Using a pre-trained instruction-tuned LLM like GPT-4o, the generator can usually generate a good response. However, it might not be exactly what is desired in terms of verbosity, style and writing prose, and additional human post-processing might be needed. Extensive prompt engineering or few-shot learning can be done to mold the response at the expense of incurring higher costs from using additional tokens in the prompt…(More)”

Automating public services


Report by Anna Dent: “…Public bodies, under financial stress and looking for effective solutions, are at risk of jumping on the automation bandwagon without critically assessing whether it’s actually appropriate for their needs, and whether the potential benefits outweigh the risks. To realise the benefits of automation and minimise problems for communities and public bodies themselves, a clear-eyed approach which really gets to grips with the risks is needed. 

The temptation to introduce automation to tackle complex social challenges is strong; they are often deep-rooted and expensive to deal with, and can have life-long implications for individuals and communities. But precisely because of their complex nature they are not the best fit for rules-based automated processes, which may fail to deliver what they set out to achieve. 

Bias is increasingly recognised as a critical challenge with automation in the public sector. Bias can be introduced through training data, and can occur when automated tools are disproportionately used on a particular community. In either case, the effectiveness of the tool or process is undermined, and citizens are at risk of discrimination, unfair targeting and exclusion from services. 

Automated tools and processes rely on huge amounts of data; in public services this will often mean personal information and data about us and our lives which we may or may not feel comfortable being used. Balancing everyone’s right to privacy with the desire for efficiency and better outcomes is rarely straightforward, and if done badly can lead to a breakdown in trust…(More)”.

The double-edged sword of AI in education


Article by Rose Luckin: “Artificial intelligence (AI) could revolutionize education as profoundly as the internet has already revolutionized our lives. However, our experience with commercial internet platforms gives us pause. Consider how social media algorithms, designed to maximize engagement and ad revenue, have inadvertently promoted divisive content and misinformation, a development at odds with educational goals.

Like the commercialization of the internet, the AI consumerization trend, driven by massive investments across sectors, prioritizes profit over societal and educational benefits. This focus on monetization risks overshadowing crucial considerations about AI’s integration into educational contexts.

The consumerization of AI in education is a double-edged sword. While increasing accessibility, it could also undermine fundamental educational principles and reshape students’ attitudes toward learning. We must advocate for a thoughtful, education-centric approach to AI development that enhances, rather than replaces, human intelligence and recognises the value of effort in learning.

As generative AI systems for education emerge, technical experts and policymakers have a unique opportunity to ensure their design supports the interests of learners and educators.

Risk 1: Overestimating AI’s intelligence

In essence, learning is not merely an individual cognitive process but a deeply social endeavor, intricately linked to cultural context, language development, and the dynamic relationship between practical experience and theoretical knowledge…(More)”.