The AI data scraping challenge:  How can we proceed responsibly?

Article by Lee Tiedrich: “Society faces an urgent and complex artificial intelligence (AI) data scraping challenge.  Left unsolved, it could threaten responsible AI innovation.  Data scraping refers to using web crawlers or other means to obtain data from third-party websites or social media properties.  Today’s large language models (LLMs) depend on vast amounts of scraped data for training and potentially other purposes.  Scraped data can include facts, creative content, computer code, personal information, brands, and just about anything else.  At least some LLM operators directly scrape data from third-party sites.  Common CrawlLAION, and other sites make scraped data readily accessible.  Meanwhile, Bright Data and others offer scraped data for a fee. 

In addition to fueling commercial LLMs, scraped data can provide researchers with much-needed data to advance social good.  For instance, Environmental Journal explains how scraped data enhances sustainability analysis.  Nature reports that scraped data improves research about opioid-related deaths.  Training data in different languages can help make AI more accessible for users in Africa and other underserved regions.  Access to training data can even advance the OECD AI Principles by improving safety and reducing bias and other harms, particularly when such data is suitable for the AI system’s intended purpose…(More)”.