When A.I. Fails the Language Test, Who Is Left Out of the Conversation?


Article by Sara Ruberg: “While the use of A.I. has exploded in the West, much of the rest of the world has been left out of the conversation since most of the technology is trained in English. A.I. experts worry that the language gap could exacerbate technological inequities, and that it could leave many regions and cultures behind.

A delay of access to good technology of even a few years, “can potentially lead to a few decades of economic delay,” said Sang Truong, a Ph.D. candidate at the Stanford Artificial Intelligence Laboratory at Stanford University on the team that built and tested a Vietnamese language model against others.

The tests his team ran found that A.I. tools across the board could get facts and diction wrong when working with Vietnamese, likely because it is a “low-resource” language by industry standards, which means that there aren’t sufficient data sets and content available online for the A.I. model to learn from.

Low-resource languages are spoken by tens and sometimes hundreds of millions of people around the world, but they yield less digital data because A.I. tech development and online engagement is centered in the United States and China. Other low-resource languages include Hindi, Bengali and Swahili, as well as lesser-known dialects spoken by smaller populations around the world.

An analysis of top websites by W3Techs, a tech survey company, found that English makes up over 60 percent of the internet’s language data. While English is widely spoken globally, native English speakers make up about 5 percent of the population, according to Ethnologue, a research organization that collects language data. Mandarin and Spanish are other examples of languages with a significant online presence and reliable digital data sets.

Academic institutions, grass-roots organizations and volunteer efforts are playing catch-up to build resources for speakers of languages who aren’t as well represented in the digital landscape.

Lelapa AI, based in Johannesburg, is one such company leading efforts on the African continent. The South African-based start-up is developing multilingual A.I. products for people and businesses in Africa…(More)”.

AI firms will soon exhaust most of the internet’s data


Article by The Economist: “One approach is to focus on data quality rather than quantity. ai labs do not simply train their models on the entire internet. They filter and sequence data to maximise how much their models learn. Naveen Rao of Databricks, an ai firm, says that this is the “main differentiator” between ai models on the market. “True information” about the world obviously matters; so does lots of “reasoning”. That makes academic textbooks, for example, especially valuable. But setting the balance between data sources remains something of a dark art. What is more, the ordering in which the system encounters different types of data matters too. Lump all the data on one topic, like maths, at the end of the training process, and your model may become specialised at maths but forget some other concepts.

These considerations can get even more complex when the data are not just on different subjects but in different forms. In part because of the lack of new textual data, leading models like Openai’s gpt-4o and Google’s Gemini are now let loose on image, video and audio files as well as text during their self-supervised learning. Training on video is hardest given how dense with data points video files are. Current models typically look at a subset of frames to simplify things.

Whatever models are used, ownership is increasingly recognised as an issue. The material used in training llms is often copyrighted and used without consent from, or payment to, the rights holders. Some ai models peep behind paywalls. Model creators claim this sort of thing falls under the “fair use” exemption in American copyright law. ai models should be allowed to read copyrighted material when they learn, just as humans can, they say. But as Benedict Evans, a technology analyst, has put it, “a difference in scale” can lead to “a difference in principle”…

It is clear that access to more data—whether culled from specialist sources, generated synthetically or provided by human experts—is key to maintaining rapid progress in ai. Like oilfields, the most accessible data reserves have been depleted. The challenge now is to find new ones—or sustainable alternatives…(More)”.

The Risks of Empowering “Citizen Data Scientists”


Article by Reid Blackman and Tamara Sipes: “Until recently, the prevailing understanding of artificial intelligence (AI) and its subset machine learning (ML) was that expert data scientists and AI engineers were the only people that could push AI strategy and implementation forward. That was a reasonable view. After all, data science generally, and AI in particular, is a technical field requiring, among other things, expertise that requires many years of education and training to obtain.

Fast forward to today, however, and the conventional wisdom is rapidly changing. The advent of “auto-ML” — software that provides methods and processes for creating machine learning code — has led to calls to “democratize” data science and AI. The idea is that these tools enable organizations to invite and leverage non-data scientists — say, domain data experts, team members very familiar with the business processes, or heads of various business units — to propel their AI efforts.

In theory, making data science and AI more accessible to non-data scientists (including technologists who are not data scientists) can make a lot of business sense. Centralized and siloed data science units can fail to appreciate the vast array of data the organization has and the business problems that it can solve, particularly with multinational organizations with hundreds or thousands of business units distributed across several continents. Moreover, those in the weeds of business units know the data they have, the problems they’re trying to solve, and can, with training, see how that data can be leveraged to solve those problems. The opportunities are significant.

In short, with great business insight, augmented with auto-ML, can come great analytic responsibility. At the same time, we cannot forget that data science and AI are, in fact, very difficult, and there’s a very long journey from having data to solving a problem. In this article, we’ll lay out the pros and cons of integrating citizen data scientists into your AI strategy and suggest methods for optimizing success and minimizing risks…(More)”.

AI mass surveillance at Paris Olympics


Article by Anne Toomey McKenna: “The 2024 Paris Olympics is drawing the eyes of the world as thousands of athletes and support personnel and hundreds of thousands of visitors from around the globe converge in France. It’s not just the eyes of the world that will be watching. Artificial intelligence systems will be watching, too.

Government and private companies will be using advanced AI tools and other surveillance tech to conduct pervasive and persistent surveillance before, during and after the Games. The Olympic world stage and international crowds pose increased security risks so significant that in recent years authorities and critics have described the Olympics as the “world’s largest security operations outside of war.”

The French government, hand in hand with the private tech sector, has harnessed that legitimate need for increased security as grounds to deploy technologically advanced surveillance and data gathering tools. Its surveillance plans to meet those risks, including controversial use of experimental AI video surveillance, are so extensive that the country had to change its laws to make the planned surveillance legal.

The plan goes beyond new AI video surveillance systems. According to news reports, the prime minister’s office has negotiated a provisional decree that is classified to permit the government to significantly ramp up traditional, surreptitious surveillance and information gathering tools for the duration of the Games. These include wiretapping; collecting geolocation, communications and computer data; and capturing greater amounts of visual and audio data…(More)”.

Community consent: neither a ceiling nor a floor


Article by Jasmine McNealy: “The 23andMe breach and the Golden State Killer case are two of the more “flashy” cases, but questions of consent, especially the consent of all of those affected by biodata collection and analysis in more mundane or routine health and medical research projects, are just as important. The communities of people affected have expectations about their privacy and the possible impacts of inferences that could be made about them in data processing systems. Researchers must, then, acquire community consent when attempting to work with networked biodata. 

Several benefits of community consent exist, especially for marginalized and vulnerable populations. These benefits include:

  • Ensuring that information about the research project spreads throughout the community,
  • Removing potential barriers that might be created by resistance from community members,
  • Alleviating the possible concerns of individuals about the perspectives of community leaders, and 
  • Allowing the recruitment of participants using methods most salient to the community.

But community consent does not replace individual consent and limits exist for both community and individual consent. Therefore, within the context of a biorepository, understanding whether community consent might be a ceiling or a floor requires examining governance and autonomy…(More)”.

The Data That Powers A.I. Is Disappearing Fast


Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

The Five Stages Of AI Grief


Essay by Benjamin Bratton: “Alignment” toward “human-centered AI” are just words representing our hopes and fears related to how AI feels like it is out of control — but also to the idea that complex technologies were never under human control to begin with. For reasons more political than perceptive, some insist that “AI” is not even “real,” that it is just math or just an ideological construction of capitalism turning itself into a naturalized fact. Some critics are clearly very angry at the all-too-real prospects of pervasive machine intelligence. Others recognize the reality of AI but are convinced it is something that can be controlled by legislative sessions, policy papers and community workshops. This does not ameliorate the depression felt by still others, who foresee existential catastrophe.

All these reactions may confuse those who see the evolution of machine intelligence, and the artificialization of intelligence itself, as an overdetermined consequence of deeper developments. What to make of these responses?

Sigmund Freud used the term “Copernican” to describe modern decenterings of the human from a place of intuitive privilege. After Nicolaus Copernicus and Charles Darwin, he nominated psychoanalysis as the third such revolution. He also characterized the response to such decenterings as “traumas.”

Trauma brings grief. This is normal. In her 1969 book, “On Death and Dying,” the Swiss psychiatrist Elizabeth Kübler-Ross identified the “five stages of grief”: denial, anger, bargaining, depression and acceptance. Perhaps Copernican Traumas are no different…(More)”.

The Department of Everything


Article by Stephen Akey: “How do you find the life expectancy of a California condor? Google it. Or the gross national product of Morocco? Google it. Or the final resting place of Tom Paine? Google it. There was a time, however—not all that long ago—when you couldn’t Google it or ask Siri or whatever cyber equivalent comes next. You had to do it the hard way—by consulting reference books, indexes, catalogs, almanacs, statistical abstracts, and myriad other printed sources. Or you could save yourself all that time and trouble by taking the easiest available shortcut: You could call me.

From 1984 to 1988, I worked in the Telephone Reference Division of the Brooklyn Public Library. My seven or eight colleagues and I spent the days (and nights) answering exactly such questions. Our callers were as various as New York City itself: copyeditors, fact checkers, game show aspirants, journalists, bill collectors, bet settlers, police detectives, students and teachers, the idly curious, the lonely and loquacious, the park bench crazies, the nervously apprehensive. (This last category comprised many anxious patients about to undergo surgery who called us for background checks on their doctors.) There were telephone reference divisions in libraries all over the country, but this being New York City, we were an unusually large one with an unusually heavy volume of calls. And if I may say so, we were one of the best. More than one caller told me that we were a legend in the world of New York magazine publishing…(More)”.

Reliability of U.S. Economic Data Is in Jeopardy, Study Finds


Article by Ben Casselman: “A report says new approaches and increased spending are needed to ensure that government statistics remain dependable and free of political influence.

Federal Reserve officials use government data to help determine when to raise or lower interest rates. Congress and the White House use it to decide when to extend jobless benefits or send out stimulus payments. Investors place billions of dollars worth of bets that are tied to monthly reports on job growth, inflation and retail sales.

But a new study says the integrity of that data is in increasing jeopardy.

The report, issued on Tuesday by the American Statistical Association, concludes that government statistics are reliable right now. But that could soon change, the study warns, citing factors including shrinking budgets, falling survey response rates and the potential for political interference.

The authors — statisticians from George Mason University, the Urban Institute and other institutions — likened the statistical system to physical infrastructure like highways and bridges: vital, but often ignored until something goes wrong.

“We do identify this sort of downward spiral as a threat, and that’s what we’re trying to counter,” said Nancy Potok, who served as chief statistician of the United States from 2017 to 2019 and was one of the report’s authors. “We’re not there yet, but if we don’t do something, that threat could become a reality, and in the not-too-distant future.”

The report, “The Nation’s Data at Risk,” highlights the threats facing statistics produced across the federal government, including data on education, health, crime and demographic trends.

But the risks to economic data are particularly notable because of the attention it receives from policymakers and investors. Most of that data is based on surveys of households or businesses. And response rates to government surveys have plummeted in recent years, as they have for private polls. The response rate to the Current Population Survey — the monthly survey of about 60,000 households that is the basis for the unemployment rate and other labor force statistics — has fallen to about 70 percent in recent months, from nearly 90 percent a decade ago…(More)”.

An Algorithm Told Police She Was Safe. Then Her Husband Killed Her.


Article by Adam Satariano and Roser Toll Pifarré: “Spain has become dependent on an algorithm to combat gender violence, with the software so woven into law enforcement that it is hard to know where its recommendations end and human decision-making begins. At its best, the system has helped police protect vulnerable women and, overall, has reduced the number of repeat attacks in domestic violence cases. But the reliance on VioGén has also resulted in victims, whose risk levels are miscalculated, getting attacked again — sometimes leading to fatal consequences.

Spain now has 92,000 active cases of gender violence victims who were evaluated by VioGén, with most of them — 83 percent — classified as facing little risk of being hurt by their abuser again. Yet roughly 8 percent of women who the algorithm found to be at negligible risk and 14 percent at low risk have reported being harmed again, according to Spain’s Interior Ministry, which oversees the system.

At least 247 women have also been killed by their current or former partner since 2007 after being assessed by VioGén, according to government figures. While that is a tiny fraction of gender violence cases, it points to the algorithm’s flaws. The New York Times found that in a judicial review of 98 of those homicides, 55 of the slain women were scored by VioGén as negligible or low risk for repeat abuse…(More)”.