The Data That Powers A.I. Is Disappearing Fast


Article by Kevin Roose: “For years, the people building powerful artificial intelligence systems have used enormous troves of text, images and videos pulled from the internet to train their models.

Now, that data is drying up.

Over the past year, many of the most important web sources used for training A.I. models have restricted the use of their data, according to a study published this week by the Data Provenance Initiative, an M.I.T.-led research group.

The study, which looked at 14,000 web domains that are included in three commonly used A.I. training data sets, discovered an “emerging crisis in consent,” as publishers and online platforms have taken steps to prevent their data from being harvested.

The researchers estimate that in the three data sets — called C4, RefinedWeb and Dolma — 5 percent of all data, and 25 percent of data from the highest-quality sources, has been restricted. Those restrictions are set up through the Robots Exclusion Protocol, a decades-old method for website owners to prevent automated bots from crawling their pages using a file called robots.txt.

The study also found that as much as 45 percent of the data in one set, C4, had been restricted by websites’ terms of service.

“We’re seeing a rapid decline in consent to use data across the web that will have ramifications not just for A.I. companies, but for researchers, academics and noncommercial entities,” said Shayne Longpre, the study’s lead author, in an interview.

Data is the main ingredient in today’s generative A.I. systems, which are fed billions of examples of text, images and videos. Much of that data is scraped from public websites by researchers and compiled in large data sets, which can be downloaded and freely used, or supplemented with data from other sources…(More)”.

The Five Stages Of AI Grief


Essay by Benjamin Bratton: “Alignment” toward “human-centered AI” are just words representing our hopes and fears related to how AI feels like it is out of control — but also to the idea that complex technologies were never under human control to begin with. For reasons more political than perceptive, some insist that “AI” is not even “real,” that it is just math or just an ideological construction of capitalism turning itself into a naturalized fact. Some critics are clearly very angry at the all-too-real prospects of pervasive machine intelligence. Others recognize the reality of AI but are convinced it is something that can be controlled by legislative sessions, policy papers and community workshops. This does not ameliorate the depression felt by still others, who foresee existential catastrophe.

All these reactions may confuse those who see the evolution of machine intelligence, and the artificialization of intelligence itself, as an overdetermined consequence of deeper developments. What to make of these responses?

Sigmund Freud used the term “Copernican” to describe modern decenterings of the human from a place of intuitive privilege. After Nicolaus Copernicus and Charles Darwin, he nominated psychoanalysis as the third such revolution. He also characterized the response to such decenterings as “traumas.”

Trauma brings grief. This is normal. In her 1969 book, “On Death and Dying,” the Swiss psychiatrist Elizabeth Kübler-Ross identified the “five stages of grief”: denial, anger, bargaining, depression and acceptance. Perhaps Copernican Traumas are no different…(More)”.

An Algorithm Told Police She Was Safe. Then Her Husband Killed Her.


Article by Adam Satariano and Roser Toll Pifarré: “Spain has become dependent on an algorithm to combat gender violence, with the software so woven into law enforcement that it is hard to know where its recommendations end and human decision-making begins. At its best, the system has helped police protect vulnerable women and, overall, has reduced the number of repeat attacks in domestic violence cases. But the reliance on VioGén has also resulted in victims, whose risk levels are miscalculated, getting attacked again — sometimes leading to fatal consequences.

Spain now has 92,000 active cases of gender violence victims who were evaluated by VioGén, with most of them — 83 percent — classified as facing little risk of being hurt by their abuser again. Yet roughly 8 percent of women who the algorithm found to be at negligible risk and 14 percent at low risk have reported being harmed again, according to Spain’s Interior Ministry, which oversees the system.

At least 247 women have also been killed by their current or former partner since 2007 after being assessed by VioGén, according to government figures. While that is a tiny fraction of gender violence cases, it points to the algorithm’s flaws. The New York Times found that in a judicial review of 98 of those homicides, 55 of the slain women were scored by VioGén as negligible or low risk for repeat abuse…(More)”.

10 profound answers about the math behind AI


Article by Ethan Siegel: “Why do machines learn? Even in the recent past, this would have been a ridiculous question, as machines — i.e., computers — were only capable of executing whatever instructions a human programmer had programmed into them. With the rise of generative AI, or artificial intelligence, however, machines truly appear to be gifted with the ability to learn, refining their answers based on continued interactions with both human and non-human users. Large language model-based artificial intelligence programs, such as ChatGPT, Claude, Gemini and more, are now so widespread that they’re replacing traditional tools, including Google searches, in applications all across the world.

How did this come to be? How did we so swiftly come to live in an era where many of us are happy to turn over aspects of our lives that traditionally needed a human expert to a computer program? From financial to medical decisions, from quantum systems to protein folding, and from sorting data to finding signals in a sea of noise, many programs that leverage artificial intelligence (AI) and machine learning (ML) are far superior at these tasks compared with even the greatest human experts.

In his new book, Why Machines Learn: The Elegant Math Behind Modern AI, science writer Anil Ananthaswamy explores all of these aspects and more. I was fortunate enough to get to do a question-and-answer interview with him, and here are the 10 most profound responses he was generous enough to give….(More)”

Mapping the Landscape of AI-Powered Nonprofits


Article by Kevin Barenblat: “Visualize the year 2050. How do you see AI having impacted the world? Whatever you’re picturing… the reality will probably be quite a bit different. Just think about the personal computer. In its early days circa the 1980s, tech companies marketed the devices for the best use cases they could imagine: reducing paperwork, doing math, and keeping track of forgettable things like birthdays and recipes. It was impossible to imagine that decades later, the larger-than-a-toaster-sized devices would be smaller than the size of Pop-Tarts, connect with billions of other devices, and respond to voice and touch.

It can be hard for us to see how new technologies will ultimately be used. The same is true of artificial intelligence. With new use cases popping up every day, we are early in the age of AI. To make sense of all the action, many landscapes have been published to organize the tech stacks and private sector applications of AI. We could not, however, find an overview of how nonprofits are using AI for impact…

AI-powered nonprofits (APNs) are already advancing solutions to many social problems, and Google.org’s recent research brief AI in Action: Accelerating Progress Towards the Sustainable Development Goals shows that AI is driving progress towards all 17 SDGs. Three goals that stand out with especially strong potential to be transformed by AI are SDG 3 (Good Health and Well-Being), SDG 4 (Quality Education), and SDG 13 (Climate Action). As such, this series focuses on how AI-powered nonprofits are transforming the climate, health care, and education sectors…(More)”.

Diversity in Artificial Intelligence Conferences


Report by the divinAI (Diversity in Artificial Intelligence) Project: “…provides a set of diversity indicators for seven core artificial intelligence (AI) conferences from 2007 to 2023: the International Joint Conference on Artificial Intelligence (IJCAI), the Annual Association for the Advancement of Artificial Intelligence (AAAI) Conference, the International Conference on Machine Learning (ICML), Neural Information Processing Systems (NeurIPS) Conference, the Association for Computing Machinery (ACM) Recommender Systems (RecSys) Conference, the European Conference on Artificial Intelligence (ECAI) and the European Conference on Machine Learning/Practice of Knowledge Discovery in Databases (ECML/PKDD) .
We observe that, in general, Conference Diversity Index (CDI) values are still low for the selected conferences, although showing a slight temporal improvement thanks to diversity initiatives in the AI field. We also note slight differences between conferences, being RecSys the one with higher comparative diversity indicators, followed by general AI conferences (IJCAI, ECAI and AAAI). The selected Machine Learning conferences NeurIPS and ICML seem to provide lower values for diversity indicators.
Regarding the different dimensions of diversity, gender diversity reflects a low proportion of female authors in all considered conferences, even given current gender diversity efforts in the field, which is in line with the low presence of women in technological fields. In terms of country distribution, we observe a notable presence of researchers from the EU, US and China in the selected conferences, where the presence of Chinese authors has increased in the last few years. Regarding institutions, universities and research centers or institutes play a central role in the AI scientific conferences under analysis, and the presence of industry seems to be more notable in machine learning conferences. An online dashboard that allows exploration and reproducibility complements the report…(More)”.

AI: a transformative force in maternal healthcare


Article by Afifa Waheed: “Artificial intelligence (AI) and robotics have enormous potential in healthcare and are quickly shifting the landscape – emerging as a transformative force. They offer a new dimension to the way healthcare professionals approach disease diagnosis, treatment and monitoring. AI is being used in healthcare to help diagnose patients, for drug discovery and development, to improve physician-patient communication, to transcribe voluminous medical documents, and to analyse genomics and genetics. Labs are conducting research work faster than ever before, work that otherwise would have taken decades without the assistance of AI. AI-driven research in life sciences has included applications looking to address broad-based areas, such as diabetes, cancer, chronic kidney disease and maternal health.

In addition to increasing the knowledge of access to postnatal and neonatal care, AI can predict the risk of adverse events in antenatal and postnatal women and their neonatal care. It can be trained to identify those at risk of adverse events, by using patients’ health information such as nutrition status, age, existing health conditions and lifestyle factors. 

AI can further be used to improve access to women located in rural areas with a lack of trained professionals – AI-enabled ultrasound can assist front-line workers with image interpretation for a comprehensive set of obstetrics measurements, increasing quality access to early foetal ultrasound scans. The use of AI assistants and chatbots can also improve pregnant mothers’ experience by helping them find available physicians, schedule appointments and even answer some patient questions…

Many healthcare professionals I have spoken to emphasised that pre-existing conditions such as high blood pressure that leads to preeclampsia, iron deficiency, cardiovascular disease, age-related issues for those over 35, various other existing health conditions, and failure in the progress of labour that might lead to Caesarean (C-section), could all cause maternal deaths. Training AI models to detect these diseases early on and accurately for women could prove to be beneficial. AI algorithms can leverage advanced algorithms, machine learning (ML) techniques, and predictive models to enhance decision-making, optimise healthcare delivery, and ultimately improve patient outcomes in foeto-maternal health…(More)”.

Gen AI: too much spend, too little benefit?


Article by Jason Koebler: “Investment giant Goldman Sachs published a research paper about the economic viability of generative AI which notes that there is “little to show for” the huge amount of spending on generative AI infrastructure and questions “whether this large spend will ever pay off in terms of AI benefits and returns.” 

The paper, called “Gen AI: too much spend, too little benefit?” is based on a series of interviews with Goldman Sachs economists and researchers, MIT professor Daron Acemoglu, and infrastructure experts. The paper ultimately questions whether generative AI will ever become the transformative technology that Silicon Valley and large portions of the stock market are currently betting on, but says investors may continue to get rich anyway. “Despite these concerns and constraints, we still see room for the AI theme to run, either because AI starts to deliver on its promise, or because bubbles take a long time to burst,” the paper notes. 

Goldman Sachs researchers also say that AI optimism is driving large growth in stocks like Nvidia and other S&P 500 companies (the largest companies in the stock market), but say that the stock price gains we’ve seen are based on the assumption that generative AI is going to lead to higher productivity (which necessarily means automation, layoffs, lower labor costs, and higher efficiency). These stock gains are already baked in, Goldman Sachs argues in the paper: “Although the productivity pick-up that AI promises could benefit equities via higher profit growth, we find that stocks often anticipate higher productivity growth before it materializes, raising the risk of overpaying. And using our new long-term return forecasting framework, we find that a very favorable AI scenario may be required for the S&P 500 to deliver above-average returns in the coming decade.”…(More)

The era of predictive AI Is almost over


Essay by Dean W. Ball: “Artificial intelligence is a Rorschach test. When OpenAI’s GPT-4 was released in March 2023, Microsoft researchers triumphantly, and prematurely, announced that it possessed “sparks” of artificial general intelligence. Cognitive scientist Gary Marcus, on the other hand, argued that Large Language Models like GPT-4 are nowhere close to the loosely defined concept of AGI. Indeed, Marcus is skeptical of whether these models “understand” anything at all. They “operate over ‘fossilized’ outputs of human language,” he wrote in a 2023 paper, “and seem capable of implementing some automatic computations pertaining to distributional statistics, but are incapable of understanding due to their lack of generative world models.” The “fossils” to which Marcus refers are the models’ training data — these days, something close to all the text on the Internet.

This notion — that LLMs are “just” next-word predictors based on statistical models of text — is so common now as to be almost a trope. It is used, both correctly and incorrectly, to explain the flaws, biases, and other limitations of LLMs. Most importantly, it is used by AI skeptics like Marcus to argue that there will soon be diminishing returns from further LLM development: We will get better and better statistical approximations of existing human knowledge, but we are not likely to see another qualitative leap toward “general intelligence.”

There are two problems with this deflationary view of LLMs. The first is that next-word prediction, at sufficient scale, can lead models to capabilities that no human designed or even necessarily intended — what some call “emergent” capabilities. The second problem is that increasingly — and, ironically, starting with ChatGPT — language models employ techniques that combust the notion of pure next-word prediction of Internet text…(More)”

Scaling Synthetic Data Creation with 1,000,000,000 Personas


Paper by Xin Chan, et al: “We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub — a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world’s total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub’s use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development…(More)”.