The Great Scrape: The Clash Between Scraping and Privacy


Paper by Daniel J. Solove and Woodrow Hartzog: “Artificial intelligence (AI) systems depend on massive quantities of data, often gathered by “scraping” – the automated extraction of large amounts of data from the internet. A great deal of scraped data is about people. This personal data provides the grist for AI tools such as facial recognition, deep fakes, and generative AI. Although scraping enables web searching, archival, and meaningful scientific research, scraping for AI can also be objectionable or even harmful to individuals and society.

Organizations are scraping at an escalating pace and scale, even though many privacy laws are seemingly incongruous with the practice. In this Article, we contend that scraping must undergo a serious reckoning with privacy law.  Scraping violates nearly all of the key principles in privacy laws, including fairness; individual rights and control; transparency; consent; purpose specification and secondary use restrictions; data minimization; onward transfer; and data security. With scraping, data protection laws built around these requirements are ignored.

Scraping has evaded a reckoning with privacy law largely because scrapers act as if all publicly available data were free for the taking. But the public availability of scraped data shouldn’t give scrapers a free pass. Privacy law regularly protects publicly available data, and privacy principles are implicated even when personal data is accessible to others.

This Article explores the fundamental tension between scraping and privacy law. With the zealous pursuit and astronomical growth of AI, we are in the midst of what we call the “great scrape.” There must now be a great reconciliation…(More)”.

(Almost) 200 Years of News-Based Economic Sentiment


Paper by Jules H. van Binsbergen, Svetlana Bryzgalova, Mayukh Mukhopadhyay & Varun Sharma: “Using text from 200 million pages of 13,000 US local newspapers and machine learning methods, we construct a 170-year-long measure of economic sentiment at the country and state levels, that expands existing measures in both the time series (by more than a century) and the cross-section. Our measure predicts GDP (both nationally and locally), consumption, and employment growth, even after controlling for commonly-used predictors, as well as monetary policy decisions. Our measure is distinct from the information in expert forecasts and leads its consensus value. Interestingly, news coverage has become increasingly negative across all states in the past half-century…(More)”.

The Collaboverse: A Collaborative Data-Sharing and Speech Analysis Platform


Paper by Justin D. Dvorak and Frank R. Boutsen: “Collaboration in the field of speech-language pathology occurs across a variety of digital devices and can entail the usage of multiple software tools, systems, file formats, and even programming languages. Unfortunately, gaps between the laboratory, clinic, and classroom can emerge in part because of siloing of data and workflows, as well as the digital divide between users. The purpose of this tutorial is to present the Collaboverse, a web-based collaborative system that unifies these domains, and describe the application of this tool to common tasks in speech-language pathology. In addition, we demonstrate its utility in machine learning (ML) applications…

This tutorial outlines key concepts in the digital divide, data management, distributed computing, and ML. It introduces the Collaboverse workspace for researchers, clinicians, and educators in speech-language pathology who wish to improve their collaborative network and leverage advanced computation abilities. It also details an ML approach to prosodic analysis….

The Collaboverse shows promise in narrowing the digital divide and is capable of generating clinically relevant data, specifically in the area of prosody, whose computational complexity has limited widespread analysis in research and clinic alike. In addition, it includes an augmentative and alternative communication app allowing visual, nontextual communication…(More)”.

Finding, distinguishing, and understanding overlooked policy entrepreneurs


Paper by Gwen Arnold, Meghan Klasic, Changtong Wu, Madeline Schomburg & Abigail York: “Scholars have spent decades arguing that policy entrepreneurs, change agents who work individually and in groups to influence the policy process, can be crucial in introducing policy innovation and spurring policy change. How to identify policy entrepreneurs empirically has received less attention. This oversight is consequential because scholars trying to understand when policy entrepreneurs emerge, and why, and what makes them more or less successful, need to be able to identify these change agents reliably and accurately. This paper explores the ways policy entrepreneurs are currently identified and highlights issues with current approaches. We introduce a new technique for eliciting and distinguishing policy entrepreneurs, coupling automated and manual analysis of local news media and a survey of policy entrepreneur candidates. We apply this technique to the empirical case of unconventional oil and gas drilling in Pennsylvania and derive some tentative results concerning factors which increase entrepreneurial efficacy…(More)”.

Protecting Policy Space for Indigenous Data Sovereignty Under International Digital Trade Law


Paper by Andrew D. Mitchell and Theo Samlidis: “The impact of economic agreements on Indigenous peoples’ broader rights and interests has been subject to ongoing scrutiny. Technological developments and an increasing emphasis on Indigenous sovereignty within the digital domain have given rise to a global Indigenous data sovereignty movement, surfacing concerns about how international economic law impacts Indigenous peoples’ sovereignty over their data. This Article examines the policy space certain governments have reserved under international economic agreements to introduce measures for protecting Indigenous data or digital sovereignty (IDS). We argue that treaty countries have secured, under recent international digital trade chapters and agreements, the benefits of a comprehensive economic treaty and sufficient regulatory autonomy to protect Indigenous data sovereignty…(More)”

Scaling Synthetic Data Creation with 1,000,000,000 Personas


Paper by Xin Chan, et al: “We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub — a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world’s total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub’s use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development…(More)”.

Collaborating with Journalists and AI: Leveraging Social Media Images for Enhanced Disaster Resilience and Recovery


Paper by Murthy Dhiraj et al: “Methods to meaningfully integrate journalists into crisis informatics remain lacking. We explored the feasibility of generating a real-time, priority-driven map of infrastructure damage during a natural disaster by strategically selecting journalist networks to identify sources of image-based infrastructure-damage data. Using the REST Twitter API, 1,000,522 tweets were collected from September 13-18, 2018, during and after Hurricane Florence made landfall in the United States. Tweets were classified by source (e.g., news organizations or citizen journalists), and 11,638 images were extracted. We utilized Google’s AutoML Vision software to successfully develop a machine learning image classification model to interpret this sample of images. As a result, 80% of our labeled data was used for training, 10% for validation, and 10% for testing. The model achieved an average precision of 90.6%, an average recall of 77.2%, and an F1 score of .834. In the future, establishing strategic networks of journalists ahead of disasters will reduce the time needed to identify disaster-response targets, thereby focusing relief and recovery efforts in real-time. This approach ultimately aims to save lives and mitigate harm…(More)”.

Exploring Digital Biomarkers for Depression Using Mobile Technology


Paper by Yuezhou Zhang et al: “With the advent of ubiquitous sensors and mobile technologies, wearables and smartphones offer a cost-effective means for monitoring mental health conditions, particularly depression. These devices enable the continuous collection of behavioral data, providing novel insights into the daily manifestations of depressive symptoms.

We found several significant links between depression severity and various behavioral biomarkers: elevated depression levels were associated with diminished sleep quality (assessed through Fitbit metrics), reduced sociability (approximated by Bluetooth), decreased levels of physical activity (quantified by step counts and GPS data), a slower cadence of daily walking (captured by smartphone accelerometers), and disturbances in circadian rhythms (analyzed across various data streams).
Leveraging digital biomarkers for assessing and continuously monitoring depression introduces a new paradigm in early detection and development of customized intervention strategies. Findings from these studies not only enhance our comprehension of depression in real-world settings but also underscore the potential of mobile technologies in the prevention and management of mental health issues…(More)”

Building an AI ecosystem in a small nation: lessons from Singapore’s journey to the forefront of AI


Paper by Shaleen Khanal, Hongzhou Zhang & Araz Taeihagh: “Artificial intelligence (AI) is arguably the most transformative technology of our time. While all nations would like to mobilize their resources to play an active role in AI development and utilization, only a few nations, such as the United States and China, have the resources and capacity to do so. If so, how can smaller or less resourceful countries navigate the technological terrain to emerge at the forefront of AI development? This research presents an in-depth analysis of Singapore’s journey in constructing a robust AI ecosystem amidst the prevailing global dominance of the United States and China. By examining the case of Singapore, we argue that by designing policies that address risks associated with AI development and implementation, smaller countries can create a vibrant AI ecosystem that encourages experimentation and early adoption of the technology. In addition, through Singapore’s case, we demonstrate the active role the government can play, not only as a policymaker but also as a steward to guide the rest of the economy towards the application of AI…(More)”.

The Role of Open Data in Driving Sectoral Innovation and Global Economic Development


Paper by Olalekan Jamiu Okunleye: “This study assessed the transformative impact of implementing open data principles on fostering innovation across various sectors and enhancing global economic development. Using a comprehensive analysis of secondary data from government portals, industry reports, and global innovation indexes between 2015 to 2019, the research employed panel data regression, correlation analysis, and descriptive statistics to evaluate key relationships. The findings indicate that the availability of open data significantly increases innovation outputs, with robust statistical evidence showing positive correlations between open data sets and sector-specific innovation metrics such as patents filed, R&D expenditure, and the number of startups created. Greater interoperability of open data across international borders contributes to economic growth, particularly through international joint ventures. However, the lack of standardized data formats hampers cross-sector collaboration. Regions with well-established open data policies demonstrate faster technological advancements and economic development compared to regions without such policies. The study highlighted the critical importance of promoting open data initiatives, standardizing data formats, strengthening data governance frameworks, and investing in digital infrastructure and capacity building to optimize open data utilization and drive sustainable development…(More)”.