When A.I. Fails the Language Test, Who Is Left Out of the Conversation?


Article by Sara Ruberg: “While the use of A.I. has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. A.I. experts worry that the language gap could exacerbate technological inequities, and that it could leave many regions and cultures behind.

A delay of access to good technology of even a few years, “can potentially lead to a few decades of economic delay,” said Sang Truong, a Ph.D. candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

The tests his team ran found that A.I. tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the A.I. model to learn from.

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because A.I. tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.

An analysis of top websites by W3Techs, a tech survey company, found that English makes up over 60 percent of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5 percent of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

Academic institutions, grass-roots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.

Lelapa AI, based in Johannesburg, is one such company leading efforts on the African continent. The South African-based start-up is developing multilingual A.I. products for people and businesses in Africa…(More)”.

Feeding the Machine: The Hidden Human Labor Powering A.I.


Book by Mark Graham, Callum Cant, and James Muldoon: “Silicon Valley has sold us the illusion that artificial intelligence is a frictionless technology that will bring wealth and prosperity to humanity. But hidden beneath this smooth surface lies the grim reality of a precarious global workforce of millions laboring under often appalling conditions to make A.I. possible. This book presents an urgent, riveting investigation of the intricate network that maintains this exploitative system, revealing the untold truth of A.I.

Based on hundreds of interviews and thousands of hours of fieldwork over more than a decade, Feeding the Machine describes the lives of the workers deliberately concealed from view, and the power structures that determine their future. It gives voice to the people whom A.I. exploits, from accomplished writers and artists to the armies of data annotators, content moderators and warehouse workers, revealing how their dangerous, low-paid labor is connected to longer histories of gendered, racialized, and colonial exploitation.

A.I. is an extraction machine that feeds off humanity’s collective effort and intelligence, churning through ever-larger datasets to power its algorithms. This book is a call to arms that details what we need to do to fight for a more just digital future…(More)”.

AI firms will soon exhaust most of the internet’s data


Article by The Economist: “One approach is to focus on data quality rather than quantity. ai labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an ai firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.

These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like Openai’s gpt-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.

Whatever models are used, ownership is increasingly recognised as an issue. The material used in training llms is often copyrighted and used without consent from, or payment to, the rights holders. Some ai models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. ai models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”…

It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in ai. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives…(More)”.

Training LLMs to Draft Replies to Parliamentary Questions


Blog by Watson Chua: “In Singapore, the government is answerable to Parliament and Members of Parliament (MPs) may raise queries to any Minister on any matter in his portfolio. These questions can be answered orally during the Parliament sitting or through a written reply. Regardless of the medium, public servants in the ministries must gather materials to answer the question and prepare a response.

Generative AI and Large Language Models (LLMs) have already been applied to help public servants do this more effectively and efficiently. For example, Pair Search (publicly accessible) and the Hansard Analysis Tool (only accessible to public servants) help public servants search for relevant information in past Parliamentary Sittings relevant to the question and synthesise a response to it.

The existing systems draft the responses using prompt engineering and Retrieval Augmented Generation (RAG). To recap, RAG consists of two main parts:

  • Retriever: A search engine that finds documents relevant to the question
  • Generator: A text generation model (LLM) that takes in the instruction, the question, and the search results from the retriever to respond to the question
A typical RAG system. Illustration by Hrishi Olickel, taken from here.

Using a pre-trained instruction-tuned LLM like GPT-4o, the generator can usually generate a good response. However, it might not be exactly what is desired in terms of verbosity, style and writing prose, and additional human post-processing might be needed. Extensive prompt engineering or few-shot learning can be done to mold the response at the expense of incurring higher costs from using additional tokens in the prompt…(More)”

The double-edged sword of AI in education


Article by Rose Luckin: “Artificial intelligence (AI) could revolutionize education as profoundly as the internet has already revolutionized our lives. However, our experience with commercial internet platforms gives us pause. Consider how social media algorithms, designed to maximize engagement and ad revenue, have inadvertently promoted divisive content and misinformation, a development at odds with educational goals.

Like the commercialization of the internet, the AI consumerization trend, driven by massive investments across sectors, prioritizes profit over societal and educational benefits. This focus on monetization risks overshadowing crucial considerations about AI’s integration into educational contexts.

The consumerization of AI in education is a double-edged sword. While increasing accessibility, it could also undermine fundamental educational principles and reshape students’ attitudes toward learning. We must advocate for a thoughtful, education-centric approach to AI development that enhances, rather than replaces, human intelligence and recognises the value of effort in learning.

As generative AI systems for education emerge, technical experts and policymakers have a unique opportunity to ensure their design supports the interests of learners and educators.

Risk 1: Overestimating AI’s intelligence

In essence, learning is not merely an individual cognitive process but a deeply social endeavor, intricately linked to cultural context, language development, and the dynamic relationship between practical experience and theoretical knowledge…(More)”.

The Data That Powers A.I. Is Disappearing Fast


Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

The Five Stages Of AI Grief


Essay by Benjamin Bratton: “Alignment” toward “human-centered AI” are just words representing our hopes and fears related to how AI feels like it is out of control — but also to the idea that complex technologies were never under human control to begin with. For reasons more political than perceptive, some insist that “AI” is not even “real,” that it is just math or just an ideological construction of capitalism turning itself into a naturalized fact. Some critics are clearly very angry at the all-too-real prospects of pervasive machine intelligence. Others recognize the reality of AI but are convinced it is something that can be controlled by legislative sessions, policy papers and community workshops. This does not ameliorate the depression felt by still others, who foresee existential catastrophe.

All these reactions may confuse those who see the evolution of machine intelligence, and the artificialization of intelligence itself, as an overdetermined consequence of deeper developments. What to make of these responses?

Sigmund Freud used the term “Copernican” to describe modern decenterings of the human from a place of intuitive privilege. After Nicolaus Copernicus and Charles Darwin, he nominated psychoanalysis as the third such revolution. He also characterized the response to such decenterings as “traumas.”

Trauma brings grief. This is normal. In her 1969 book, “On Death and Dying,” the Swiss psychiatrist Elizabeth Kübler-Ross identified the “five stages of grief”: denial, anger, bargaining, depression and acceptance. Perhaps Copernican Traumas are no different…(More)”.

An Algorithm Told Police She Was Safe. Then Her Husband Killed Her.


Article by Adam Satariano and Roser Toll Pifarré: “Spain has become dependent on an algorithm to combat gender violence, with the software so woven into law enforcement that it is hard to know where its recommendations end and human decision-making begins. At its best, the system has helped police protect vulnerable women and, overall, has reduced the number of repeat attacks in domestic violence cases. But the reliance on VioGén has also resulted in victims, whose risk levels are miscalculated, getting attacked again — sometimes leading to fatal consequences.

Spain now has 92,000 active cases of gender violence victims who were evaluated by VioGén, with most of them — 83 percent — classified as facing little risk of being hurt by their abuser again. Yet roughly 8 percent of women who the algorithm found to be at negligible risk and 14 percent at low risk have reported being harmed again, according to Spain’s Interior Ministry, which oversees the system.

At least 247 women have also been killed by their current or former partner since 2007 after being assessed by VioGén, according to government figures. While that is a tiny fraction of gender violence cases, it points to the algorithm’s flaws. The New York Times found that in a judicial review of 98 of those homicides, 55 of the slain women were scored by VioGén as negligible or low risk for repeat abuse…(More)”.

10 profound answers about the math behind AI


Article by Ethan Siegel: “Why do machines learn? Even in the recent past, this would have been a ridiculous question, as machines — i.e., computers — were only capable of executing whatever instructions a human programmer had programmed into them. With the rise of generative AI, or artificial intelligence, however, machines truly appear to be gifted with the ability to learn, refining their answers based on continued interactions with both human and non-human users. Large language model-based artificial intelligence programs, such as ChatGPT, Claude, Gemini and more, are now so widespread that they’re replacing traditional tools, including Google searches, in applications all across the world.

How did this come to be? How did we so swiftly come to live in an era where many of us are happy to turn over aspects of our lives that traditionally needed a human expert to a computer program? From financial to medical decisions, from quantum systems to protein folding, and from sorting data to finding signals in a sea of noise, many programs that leverage artificial intelligence (AI) and machine learning (ML) are far superior at these tasks compared with even the greatest human experts.

In his new book, Why Machines Learn: The Elegant Math Behind Modern AI, science writer Anil Ananthaswamy explores all of these aspects and more. I was fortunate enough to get to do a question-and-answer interview with him, and here are the 10 most profound responses he was generous enough to give….(More)”

Mapping the Landscape of AI-Powered Nonprofits


Article by Kevin Barenblat: “Visualize the year 2050. How do you see AI having impacted the world? Whatever you’re picturing… the reality will probably be quite a bit different. Just think about the personal computer. In its early days circa the 1980s, tech companies marketed the devices for the best use cases they could imagine: reducing paperwork, doing math, and keeping track of forgettable things like birthdays and recipes. It was impossible to imagine that decades later, the larger-than-a-toaster-sized devices would be smaller than the size of Pop-Tarts, connect with billions of other devices, and respond to voice and touch.

It can be hard for us to see how new technologies will ultimately be used. The same is true of artificial intelligence. With new use cases popping up every day, we are early in the age of AI. To make sense of all the action, many landscapes have been published to organize the tech stacks and private sector applications of AI. We could not, however, find an overview of how nonprofits are using AI for impact…

AI-powered nonprofits (APNs) are already advancing solutions to many social problems, and Google.org’s recent research brief AI in Action: Accelerating Progress Towards the Sustainable Development Goals shows that AI is driving progress towards all 17 SDGs. Three goals that stand out with especially strong potential to be transformed by AI are SDG 3 (Good Health and Well-Being), SDG 4 (Quality Education), and SDG 13 (Climate Action). As such, this series focuses on how AI-powered nonprofits are transforming the climate, health care, and education sectors…(More)”.