Explore our articles

View All Results

artificial intelligence, DATA

Share:

Training Data for the Price of a Sandwich

Article by Stefan Baack: “Common Crawl (henceforth also referred to as CC) is an organization that has been essential to the technological advancements of generative AI, but is largely unknown to the broader public. This California nonprofit with only a handful of employees has crawled billions of web pages since 2008 and it makes this data available without charge via Amazon Web Services (AWS). Because of the enormous size and diversity (in terms of sources and formats) of the data, it has been pivotal as a source for training data for many AI builders. Generative AI in its current form would probably not be possible without Common Crawl, given that the vast majority of data used to train the original model behind OpenAI’s ChatGPT, the generative AI product that set off the current hype, came from it (Brown et al. 2020). The same is true for many models published since then.

Although pivotal, Common Crawl has so far received relatively little attention for its contribution to generative AI…(More)”.

Share

How to contribute:

Did you come across – or create – a compelling project/report/book/app at the leading edge of innovation in governance?

Share it with us at info@thelivinglib.org so that we can add it to the Collection!

About the Curator

Stefaan Verhulst

Get the latest news right in your inbox

Subscribe to curated findings and actionable knowledge from The Living Library, delivered to your inbox every Friday

Related articles

Artificial Intelligence, Collection, DATA

Artificial IntelligenceDATA

Artificial Intelligence

A Large-Language-Model Framework for Automated Humanitarian Situation Reporting

Posted in March 11, 2026 by Stefaan Verhulst

Artificial Intelligence, Collection, DATA

Artificial IntelligenceDATA

Artificial Intelligence

AI agents are coming for government. How one big city is letting them in

Posted in March 10, 2026 by Stefaan Verhulst

Artificial Intelligence, Collection, DATA

Artificial IntelligenceDATA

Artificial Intelligence

The train has left the station: Agentic AI and the future of social science research

Posted in March 4, 2026 by Stefaan Verhulst