Philanthropy by the Numbers


Essay by Aaron Horvath: “Foundations make grants conditional on demonstrable results. Charities tout the evidentiary basis of their work. And impact consultants play both sides: assisting funders in their pursuit of rational beneficence and helping grantees translate the jumble of reality into orderly, spreadsheet-ready metrics.

Measurable impact has crept into everyday understandings of charity as well. There’s the extensive (often fawning) news coverage of data-crazed billionaire philanthropists, so-called thought leaders exhorting followers to rethink their contributions to charity, and popular books counseling that intuition and sentiment are poor guides for making the world a better place. Putting ideas into action, charity evaluators promote research-backed listings of the most impactful nonprofits. Why give to your local food bank when there’s one in Somerville, Massachusetts, with a better rating?

Over the past thirty years, amid a larger crisis of civic engagement, social isolation, and political alienation, measurable impact has seeped into our civic imagination and become one of the guiding ideals for public-spirited beneficence. And while its proponents do not always agree on how best to achieve or measure the extent of that impact, they have collectively recast civic engagement as objective, pragmatic, and above the fray of politics—a triumph of the head over the heart. But how did we get here? And what happens to our capacity for meaningful collective action when we think of civic life in such depersonalized and quantified terms?…(More)”.

To Whom Does the World Belong?


Essay by Alexander Hartley: “For an idea of the scale of the prize, it’s worth remembering that 90 percent of recent U.S. economic growth, and 65 percent of the value of its largest 500 companies, is already accounted for by intellectual property. By any estimate, AI will vastly increase the speed and scale at which new intellectual products can be minted. The provision of AI services themselves is estimated to become a trillion-dollar market by 2032, but the value of the intellectual property created by those services—all the drug and technology patents; all the images, films, stories, virtual personalities—will eclipse that sum. It is possible that the products of AI may, within my lifetime, come to represent a substantial portion of all the world’s financial value.

In this light, the question of ownership takes on its true scale, revealing itself as a version of Bertolt Brecht’s famous query: To whom does the world belong?


Questions of AI authorship and ownership can be divided into two broad types. One concerns the vast troves of human-authored material fed into AI models as part of their “training” (the process by which their algorithms “learn” from data). The other concerns ownership of what AIs produce. Call these, respectively, the input and output problems.

So far, attention—and lawsuits—have clustered around the input problem. The basic business model for LLMs relies on the mass appropriation of human-written text, and there simply isn’t anywhere near enough in the public domain. OpenAI hasn’t been very forthcoming about its training data, but GPT-4 was reportedly trained on around thirteen trillion “tokens,” roughly the equivalent of ten trillion words. This text is drawn in large part from online repositories known as “crawls,” which scrape the internet for troves of text from news sites, forums, and other sources. Fully aware that vast data scraping is legally untested—to say the least—developers charged ahead anyway, resigning themselves to litigating the issue in retrospect. Lawyer Peter Schoppert has called the training of LLMs without permission the industry’s “original sin”—to be added, we might say, to the technology’s mind-boggling consumption of energy and water in an overheating planet. (In September, Bloomberg reported that plans for new gas-fired power plants have exploded as energy companies are “racing to meet a surge in demand from power-hungry AI data centers.”)…(More)”.

Beyond checking a box: how a social licence can help communities benefit from data reuse and AI


Article by Stefaan Verhulst and Peter Addo: “In theory, consent offers a mechanism to reduce power imbalances. In reality, existing consent mechanisms are limited and, in many respects, archaic, based on binary distinctions – typically presented in check-the-box forms that most websites use to ask you to register for marketing e-mails – that fail to appreciate the nuance and context-sensitive nature of data reuse. Consent today generally means individual consent, a notion that overlooks the broader needs of communities and groups.

While we understand the need to safeguard information about an individual such as, say, their health status, this information can help address or even prevent societal health crises. Individualised notions of consent fail to consider the potential public good of reusing individual data responsibly. This makes them particularly problematic in societies that have more collective orientations, where prioritising individual choices could disrupt the social fabric.

The notion of a social licence, which has its roots in the 1990s within the extractive industries, refers to the collective acceptance of an activity, such as data reuse, based on its perceived alignment with community values and interests. Social licences go beyond the priorities of individuals and help balance the risks of data misuse and missed use (for example, the risks of violating privacy vs. neglecting to use private data for public good). Social licences permit a broader notion of consent that is dynamic, multifaceted and context-sensitive.

Policymakers, citizens, health providers, think tanks, interest groups and private industry must accept the concept of a social licence before it can be established. The goal for all stakeholders is to establish widespread consensus on community norms and an acceptable balance of social risk and opportunity.

Community engagement can create a consensus-based foundation for preferences and expectations concerning data reuse. Engagement could take place via dedicated “data assemblies” or community deliberations about data reuse for particular purposes under particular conditions. The process would need to involve voices as representative as possible of the different parties involved, and include those that are traditionally marginalised or silenced…(More)”.

A linkless internet


Essay by Collin Jennings: “..But now Google and other websites are moving away from relying on links in favour of artificial intelligence chatbots. Considered as preserved trails of connected ideas, links make sense as early victims of the AI revolution since large language models (LLMs) such as ChatGPT, Google’s Gemini and others abstract the information represented online and present it in source-less summaries. We are at a moment in the history of the web in which the link itself – the countless connections made by website creators, the endless tapestry of ideas woven together throughout the web – is in danger of going extinct. So it’s pertinent to ask: how did links come to represent information in the first place? And what’s at stake in the movement away from links toward AI chat interfaces?

To answer these questions, we need to go back to the 17th century, when writers and philosophers developed the theory of mind that ultimately inspired early hypertext plans. In this era, prominent philosophers, including Thomas Hobbes and John Locke, debated the extent to which a person controls the succession of ideas that appears in her mind. They posited that the succession of ideas reflects the interaction between the data received from the senses and one’s mental faculties – reason and imagination. Subsequently, David Hume argued that all successive ideas are linked by association. He enumerated three kinds of associative connections among ideas: resemblance, contiguity, and cause and effect. In An Enquiry Concerning Human Understanding (1748), Hume offers examples of each relationship:

A picture naturally leads our thoughts to the original: the mention of one apartment in a building naturally introduces an enquiry or discourse concerning the others: and if we think of a wound, we can scarcely forbear reflecting on the pain which follows it.

The mind follows connections found in the world. Locke and Hume believed that all human knowledge comes from experience, and so they had to explain how the mind receives, processes and stores external data. They often reached for media metaphors to describe the relationship between the mind and the world. Locke compared the mind to a blank tablet, a cabinet and a camera obscura. Hume relied on the language of printing to distinguish between the vivacity of impressions imprinted upon one’s senses and the ideas recalled in the mind…(More)”.

Harvard Is Releasing a Massive Free AI Training Dataset Funded by OpenAI and Microsoft


Article by Kate Knibbs: “Harvard University announced Thursday it’s releasing a high-quality dataset of nearly 1 million public-domain books that could be used by anyone to train large language models and other AI tools. The dataset was created by Harvard’s newly formed Institutional Data Initiative with funding from both Microsoft and OpenAI. It contains books scanned as part of the Google Books project that are no longer protected by copyright.

Around five times the size of the notorious Books3 dataset that was used to train AI models like Meta’s Llama, the Institutional Data Initiative’s database spans genres, decades, and languages, with classics from Shakespeare, Charles Dickens, and Dante included alongside obscure Czech math textbooks and Welsh pocket dictionaries. Greg Leppert, executive director of the Institutional Data Initiative, says the project is an attempt to “level the playing field” by giving the general public, including small players in the AI industry and individual researchers, access to the sort of highly-refined and curated content repositories that normally only established tech giants have the resources to assemble. “It’s gone through rigorous review,” he says…(More)”.

How Years of Reddit Posts Have Made the Company an AI Darling


Article by Sarah E. Needleman: “Artificial-intelligence companies were one of Reddit’s biggest frustrations last year. Now they are a key source of growth for the social-media platform. 

These companies have an insatiable appetite for online data to train their models and display content in an easy-to-digest format. In mid-2023, Reddit, a social-media veteran and IPO newbie, turned off the spigot and began charging some businesses for access to its data. 

It turns out that Reddit’s ever-growing 19-year warehouse of user commentary makes it an attractive resource for AI companies. The platform recently reported its first quarterly profit as a publicly traded company, thanks partly to data-licensing deals it made in the past year with OpenAI and Google.

Reddit Chief Executive and co-founder Steve Huffman has said the company had to stop giving away its valuable data to the world’s largest companies for free. 

“It is an arms race,” he said at The Wall Street Journal’s Tech Live conference in October. “But we’re in talks with just about everybody, so we’ll see where these things land.”

Reddit’s huge amount of data works well for AI companies because it is organized by topics and uses a voting system instead of an algorithm to sort content quality, and because people’s posts tend to be candid.

For the first nine months of 2024, Reddit’s revenue category that includes licensing grew to $81.6 million from $12.3 million a year earlier.

While data-licensing revenue remains dwarfed by Reddit’s core advertising sales, the new category’s rapid growth reveals a potential lucrative business line with relatively high margins.

Diversifying away from a reliance on advertising, while tapping into an AI-adjacent market, has also made Reddit attractive to investors who are searching for new exposure to the latest technology boom. Reddit’s stock has more than doubled in the past three months.

The source of Reddit’s newfound wealth is the burgeoning market for AI-useful data. Reddit’s willingness to sell its data to AI outfits makes it stand out, because there is only a finite amount of data available for AI companies to gobble up for free or purchase. Some executives and researchers say the industry’s need for high-quality text could outstrip supply within two years, potentially slowing AI’s development…(More)”.

This is where the data to build AI comes from


Article by Melissa Heikkilä and Stephanie Arnett: “AI is all about data. Reams and reams of data are needed to train algorithms to do what we want, and what goes into the AI models determines what comes out. But here’s the problem: AI developers and researchers don’t really know much about the sources of the data they are using. AI’s data collection practices are immature compared with the sophistication of AI model development. Massive data sets often lack clear information about what is in them and where it came from. 

The Data Provenance Initiative, a group of over 50 researchers from both academia and industry, wanted to fix that. They wanted to know, very simply: Where does the data to build AI come from? They audited nearly 4,000 public data sets spanning over 600 languages, 67 countries, and three decades. The data came from 800 unique sources and nearly 700 organizations. 

Their findings, shared exclusively with MIT Technology Review, show a worrying trend: AI’s data practices risk concentrating power overwhelmingly in the hands of a few dominant technology companies. 

In the early 2010s, data sets came from a variety of sources, says Shayne Longpre, a researcher at MIT who is part of the project. 

It came not just from encyclopedias and the web, but also from sources such as parliamentary transcripts, earning calls, and weather reports. Back then, AI data sets were specifically curated and collected from different sources to suit individual tasks, Longpre says.

Then transformers, the architecture underpinning language models, were invented in 2017, and the AI sector started seeing performance get better the bigger the models and data sets were. Today, most AI data sets are built by indiscriminately hoovering material from the internet. Since 2018, the web has been the dominant source for data sets used in all media, such as audio, images, and video, and a gap between scraped data and more curated data sets has emerged and widened.

“In foundation model development, nothing seems to matter more for the capabilities than the scale and heterogeneity of the data and the web,” says Longpre. The need for scale has also boosted the use of synthetic data massively.

The past few years have also seen the rise of multimodal generative AI models, which can generate videos and images. Like large language models, they need as much data as possible, and the best source for that has become YouTube. 

For video models, as you can see in this chart, over 70% of data for both speech and image data sets comes from one source.

This could be a boon for Alphabet, Google’s parent company, which owns YouTube. Whereas text is distributed across the web and controlled by many different websites and platforms, video data is extremely concentrated in one platform.

“It gives a huge concentration of power over a lot of the most important data on the web to one company,” says Longpre…(More)”.

How cities are reinventing the public-private partnership − 4 lessons from around the globe


Article by Debra Lam: “Cities tackle a vast array of responsibilities – from building transit networks to running schools – and sometimes they can use a little help. That’s why local governments have long teamed up with businesses in so-called public-private partnerships. Historically, these arrangements have helped cities fund big infrastructure projects such as bridges and hospitals.

However, our analysis and research show an emerging trend with local governments engaged in private-sector collaborations – what we have come to describe as “community-centered, public-private partnerships,” or CP3s. Unlike traditional public-private partnerships, CP3s aren’t just about financial investments; they leverage relationships and trust. And they’re about more than just building infrastructure; they’re about building resilient and inclusive communities.

As the founding executive director of the Partnership for Inclusive Innovation, based out of the Georgia Institute of Technology, I’m fascinated with CP3s. And while not all CP3s are successful, when done right they offer local governments a powerful tool to navigate the complexities of modern urban life.

Together with international climate finance expert Andrea Fernández of the urban climate leadership group C40, we analyzed community-centered, public-private partnerships across the world and put together eight case studies. Together, they offer valuable insights into how cities can harness the power of CP3s.

4 keys to success

Although we looked at partnerships forged in different countries and contexts, we saw several elements emerge as critical to success over and over again.

• 1. Clear mission and vision: It’s essential to have a mission that resonates with everyone involved. Ruta N in Medellín, Colombia, for example, transformed the city into a hub of innovation, attracting 471 technology companies and creating 22,500 jobs.

This vision wasn’t static. It evolved in response to changing local dynamics, including leadership priorities and broader global trends. However, the core mission of entrepreneurship, investment and innovation remained clear and was embraced by all key stakeholders, driving the partnership forward.

 2. Diverse and engaged partners: Successful CP3s rely on the active involvement of a wide range of partners, each bringing their unique expertise and resources to the table. In the U.K., for example, the Hull net-zero climate initiative featured a partnership that included more than 150 companies, many small and medium-size. This diversity of partners was crucial to the initiative’s success because they could leverage resources and share risks, enabling it to address complex challenges from multiple angles.

Similarly, Malaysia’s Think City engaged community-based organizations and vulnerable populations in its Penang climate adaptation program. This ensured that the partnership was inclusive and responsive to the needs of all citizens…(More)”.

Why probability probably doesn’t exist (but it is useful to act like it does)


Article by David Spiegelhalter: “Life is uncertain. None of us know what is going to happen. We know little of what has happened in the past, or is happening now outside our immediate experience. Uncertainty has been called the ‘conscious awareness of ignorance’1 — be it of the weather tomorrow, the next Premier League champions, the climate in 2100 or the identity of our ancient ancestors.

In daily life, we generally express uncertainty in words, saying an event “could”, “might” or “is likely to” happen (or have happened). But uncertain words can be treacherous. When, in 1961, the newly elected US president John F. Kennedy was informed about a CIA-sponsored plan to invade communist Cuba, he commissioned an appraisal from his military top brass. They concluded that the mission had a 30% chance of success — that is, a 70% chance of failure. In the report that reached the president, this was rendered as “a fair chance”. The Bay of Pigs invasion went ahead, and was a fiasco. There are now established scales for converting words of uncertainty into rough numbers. Anyone in the UK intelligence community using the term ‘likely’, for example, should mean a chance of between 55% and 75% (see go.nature.com/3vhu5zc).

Attempts to put numbers on chance and uncertainty take us into the mathematical realm of probability, which today is used confidently in any number of fields. Open any science journal, for example, and you’ll find papers liberally sprinkled with P values, confidence intervals and possibly Bayesian posterior distributions, all of which are dependent on probability.

And yet, any numerical probability, I will argue — whether in a scientific paper, as part of weather forecasts, predicting the outcome of a sports competition or quantifying a health risk — is not an objective property of the world, but a construction based on personal or collective judgements and (often doubtful) assumptions. Furthermore, in most circumstances, it is not even estimating some underlying ‘true’ quantity. Probability, indeed, can only rarely be said to ‘exist’ at all…(More)”.

The AI revolution is running out of data. What can researchers do?


Article by Nicola Jones: “The Internet is a vast ocean of human knowledge, but it isn’t infinite. And artificial intelligence (AI) researchers have nearly sucked it dry.

The past decade of explosive improvement in AI has been driven in large part by making neural networks bigger and training them on ever-more data. This scaling has proved surprisingly effective at making large language models (LLMs) — such as those that power the chatbot ChatGPT — both more capable of replicating conversational language and of developing emergent properties such as reasoning. But some specialists say that we are now approaching the limits of scaling. That’s in part because of the ballooning energy requirements for computing. But it’s also because LLM developers are running out of the conventional data sets used to train their models.

A prominent study1 made headlines this year by putting a number on this problem: researchers at Epoch AI, a virtual research institute, projected that, by around 2028, the typical size of data set used to train an AI model will reach the same size as the total estimated stock of public online text. In other words, AI is likely to run out of training data in about four years’ time (see ‘Running out of data’). At the same time, data owners — such as newspaper publishers — are starting to crack down on how their content can be used, tightening access even more. That’s causing a crisis in the size of the ‘data commons’, says Shayne Longpre, an AI researcher at the Massachusetts Institute of Technology in Cambridge who leads the Data Provenance Initiative, a grass-roots organization that conducts audits of AI data sets.

The imminent bottleneck in training data could be starting to pinch. “I strongly suspect that’s already happening,” says Longpre…(More)”

Running out of data: Chart showing projections of the amount of text data used to train large language models and the amount of available text on the Internet, suggesting that by 2028, developers will be using data sets that match the total amount of text that is available.