The AI data scraping challenge: How can we proceed responsibly?

Article by Lee Tiedrich: “Society faces an urgent and complex artificial intelligence (AI) data scraping challenge. Left unsolved, it could threaten responsible AI innovation. Data scraping refers to using web crawlers or other means to obtain data from third-party websites or social media properties. Today’s large language models (LLMs) depend on vast amounts of scraped data for training and potentially other purposes. Scraped data can include facts, creative content, computer code, personal information, brands, and just about anything else. At least some LLM operators directly scrape data from third-party sites. Common Crawl, LAION, and other sites make scraped data readily accessible. Meanwhile, Bright Data and others offer scraped data for a fee.

In addition to fueling commercial LLMs, scraped data can provide researchers with much-needed data to advance social good. For instance, Environmental Journal explains how scraped data enhances sustainability analysis. Nature reports that scraped data improves research about opioid-related deaths. Training data in different languages can help make AI more accessible for users in Africa and other underserved regions. Access to training data can even advance the OECD AI Principles by improving safety and reducing bias and other harms, particularly when such data is suitable for the AI system’s intended purpose…(More)”.

Share

How to contribute:

Did you come across – or create – a compelling project/report/book/app at the leading edge of innovation in governance?

Share it with us at info@thelivinglib.org so that we can add it to the Collection!

About the Curator

Stefaan Verhulst

Get the latest news right in your inbox

Subscribe to curated findings and actionable knowledge from The Living Library, delivered to your inbox every Friday

Explore our articles

The AI data scraping challenge: How can we proceed responsibly?

Share

How to contribute:

About the Curator

Stefaan Verhulst

Get the latest news right in your inbox

Related articles

Technology, Management, and Design for Social Justice

The Re-Use of Non-Traditional Data for Public Interest Purposes

What U.S. Restrictions on Satellite Imagery Mean for Iran Reporting