Training Data for the Price of a Sandwich


Article by Stefan Baack: “Common Crawl (henceforth also referred to as CC) is an organization that has been essential to the technological advancements of generative AI, but is largely unknown to the broader public. This California nonprofit with only a handful of employees has crawled billions of web pages since 2008 and it makes this data available without charge via Amazon Web Services (AWS). Because of the enormous size and diversity (in terms of sources and formats) of the data, it has been pivotal as a source for training data for many AI builders. Generative AI in its current form would probably not be possible without Common Crawl, given that the vast majority of data used to train the original model behind OpenAI’s ChatGPT, the generative AI product that set off the current hype, came from it (Brown et al. 2020). The same is true for many models published since then.

Although pivotal, Common Crawl has so far received relatively little attention for its contribution to generative AI…(More)”.