The Limits of Data


Essay by C.Thi Nguyen: “…Right now, the language of policymaking is data. (I’m talking about “data” here as a concept, not as particular measurements.) Government agencies, corporations, and other policymakers all want to make decisions based on clear data about positive outcomes.  They want to succeed on the metrics—to succeed in clear, objective, and publicly comprehensible terms. But metrics and data are incomplete by their basic nature. Every data collection method is constrained and every dataset is filtered.

Some very important things don’t make their way into the data. It’s easier to justify health care decisions in terms of measurable outcomes: increased average longevity or increased numbers of lives saved in emergency room visits, for example. But there are so many important factors that are far harder to measure: happiness, community, tradition, beauty, comfort, and all the oddities that go into “quality of life.”

Consider, for example, a policy proposal that doctors should urge patients to sharply lower their saturated fat intake. This should lead to better health outcomes, at least for those that are easier to measure: heart attack numbers and average longevity. But the focus on easy-to-measure outcomes often diminishes the salience of other downstream consequences: the loss of culinary traditions, disconnection from a culinary heritage, and a reduction in daily culinary joy. It’s easy to dismiss such things as “intangibles.” But actually, what’s more tangible than a good cheese, or a cheerful fondue party with friends?…(More)”.

Automakers Are Sharing Consumers’ Driving Behavior With Insurance Companies


Article by Kashmir Hill: “Kenn Dahl says he has always been a careful driver. The owner of a software company near Seattle, he drives a leased Chevrolet Bolt. He’s never been responsible for an accident.

So Mr. Dahl, 65, was surprised in 2022 when the cost of his car insurance jumped by 21 percent. Quotes from other insurance companies were also high. One insurance agent told him his LexisNexis report was a factor.

LexisNexis is a New York-based global data broker with a “Risk Solutions” division that caters to the auto insurance industry and has traditionally kept tabs on car accidents and tickets. Upon Mr. Dahl’s request, LexisNexis sent him a 258-page “consumer disclosure report,” which it must provide per the Fair Credit Reporting Act.

What it contained stunned him: more than 130 pages detailing each time he or his wife had driven the Bolt over the previous six months. It included the dates of 640 trips, their start and end times, the distance driven and an accounting of any speeding, hard braking or sharp accelerations. The only thing it didn’t have is where they had driven the car.

On a Thursday morning in June for example, the car had been driven 7.33 miles in 18 minutes; there had been two rapid accelerations and two incidents of hard braking.

According to the report, the trip details had been provided by General Motors — the manufacturer of the Chevy Bolt. LexisNexis analyzed that driving data to create a risk score “for insurers to use as one factor of many to create more personalized insurance coverage,” according to a LexisNexis spokesman, Dean Carney. Eight insurance companies had requested information about Mr. Dahl from LexisNexis over the previous month.

“It felt like a betrayal,” Mr. Dahl said. “They’re taking information that I didn’t realize was going to be shared and screwing with our insurance.”..(More)”.

A Plan to Develop Open Science’s Green Shoots into a Thriving Garden


Article by Greg Tananbaum, Chelle Gentemann, Kamran Naim, and Christopher Steven Marcum: “…As it’s moved from an abstract set of principles about access to research and data into the realm of real-world activities, the open science movement has mirrored some of the characteristics of the open source movement: distributed, independent, with loosely coordinated actions happening in different places at different levels. Globally, many things are happening, often disconnected, but still interrelated: open science has sowed a constellation of thriving green shoots, not quite yet a garden, but all growing rapidly on arable soil.

Streamlining research processes, reducing duplication of efforts, and accelerating scientific discoveries could ensure that the fruits of open science processes and products are more accessible and equitably distributed.

It is now time to consider how much faster and farther the open science movement could go with more coordination. What efficiencies might be realized if disparate efforts could better harmonize across geographies, disciplines, and sectors? How would an intentional, systems-level approach to aligning incentives, infrastructure, training, and other key components of a rationally functioning research ecosystem advance the wider goals of the movement? Streamlining research processes, reducing duplication of efforts, and accelerating scientific discoveries could ensure that the fruits of open science processes and products are more accessible and equitably distributed…(More)”

Synthetic Data and the Future of AI


Paper by Peter Lee: “The future of artificial intelligence (AI) is synthetic. Several of the most prominent technical and legal challenges of AI derive from the need to amass huge amounts of real-world data to train machine learning (ML) models. Collecting such real-world data can be highly difficult and can threaten privacy, introduce bias in automated decision making, and infringe copyrights on a massive scale. This Article explores the emergence of a seemingly paradoxical technical creation that can mitigate—though not completely eliminate—these concerns: synthetic data. Increasingly, data scientists are using simulated driving environments, fabricated medical records, fake images, and other forms of synthetic data to train ML models. Artificial data, in other words, is being used to train artificial intelligence. Synthetic data offers a host of technical and legal benefits; it promises to radically decrease the cost of obtaining data, sidestep privacy issues, reduce automated discrimination, and avoid copyright infringement. Alongside such promise, however, synthetic data offers perils as well. Deficiencies in the development and deployment of synthetic data can exacerbate the dangers of AI and cause significant social harm.

In light of the enormous value and importance of synthetic data, this Article sketches the contours of an innovation ecosystem to promote its robust and responsible development. It identifies three objectives that should guide legal and policy measures shaping the creation of synthetic data: provisioning, disclosure, and democratization. Ideally, such an ecosystem should incentivize the generation of high-quality synthetic data, encourage disclosure of both synthetic data and processes for generating it, and promote multiple sources of innovation. This Article then examines a suite of “innovation mechanisms” that can advance these objectives, ranging from open source production to proprietary approaches based on patents, trade secrets, and copyrights. Throughout, it suggests policy and doctrinal reforms to enhance innovation, transparency, and democratic access to synthetic data. Just as AI will have enormous legal implications, law and policy can play a central role in shaping the future of AI…(More)”.

Prompting Diverse Ideas: Increasing AI Idea Variance


Paper by Lennart Meincke, Ethan Mollick, and Christian Terwiesch: “Unlike routine tasks where consistency is prized, in creativity and innovation the goal is to create a diverse set of ideas. This paper delves into the burgeoning interest in employing Artificial Intelligence (AI) to enhance the productivity and quality of the idea generation process. While previous studies have found that the average quality of AI ideas is quite high, prior research also has pointed to the inability of AI-based brainstorming to create sufficient dispersion of ideas, which limits novelty and the quality of the overall best idea. Our research investigates methods to increase the dispersion in AI-generated ideas. Using GPT-4, we explore the effect of different prompting methods on Cosine Similarity, the number of unique ideas, and the speed with which the idea space gets exhausted. We do this in the domain of developing a new product development for college students, priced under $50. In this context, we find that (1) pools of ideas generated by GPT-4 with various plausible prompts are less diverse than ideas generated by groups of human subjects (2) the diversity of AI generated ideas can be substantially improved using prompt engineering (3) Chain-of-Thought (CoT) prompting leads to the highest diversity of ideas of all prompts we evaluated and was able to come close to what is achieved by groups of human subjects. It also was capable of generating the highest number of unique ideas of any prompt we studied…(More)”

All the News That’s Fit to Click: How Metrics Are Transforming the Work of Journalists


Book by Caitlin Petre: “Journalists today are inundated with data about which stories attract the most clicks, likes, comments, and shares. These metrics influence what stories are written, how news is promoted, and even which journalists get hired and fired. Do metrics make journalists more accountable to the public? Or are these data tools the contemporary equivalent of a stopwatch wielded by a factory boss, worsening newsroom working conditions and journalism quality? In All the News That’s Fit to Click, Caitlin Petre takes readers behind the scenes at the New York TimesGawker, and the prominent news analytics company Chartbeat to explore how performance metrics are transforming the work of journalism.

Petre describes how digital metrics are a powerful but insidious new form of managerial surveillance and discipline. Real-time analytics tools are designed to win the trust and loyalty of wary journalists by mimicking key features of addictive games, including immersive displays, instant feedback, and constantly updated “scores” and rankings. Many journalists get hooked on metrics—and pressure themselves to work ever harder to boost their numbers.

Yet this is not a simple story of managerial domination. Contrary to the typical perception of metrics as inevitably disempowering, Petre shows how some journalists leverage metrics to their advantage, using them to advocate for their professional worth and autonomy…(More)”.

Trust in AI companies drops to 35 percent in new study


Article by Filip Timotija: “Trust in artificial intelligence (AI) companies has dipped to 35 percent over a five-year period in the U.S., according to new data.

The data, released Tuesday by public relations firm Edelman, found that trust in AI companies also dropped globally by eight points, going from 61 percent to 53 percent. 

The dwindling confidence in the rapidly-developing tech industry comes as regulators in the U.S. and across the globe are brainstorming solutions on how to regulate the sector. 

When broken down my political party, researchers found Democrats showed the most trust in AI companies at 38 percent — compared to Republicans’ 24 percent and independents’ 25 percent, per the study.

Multiple factors contributed to the decline in trust toward the companies polled in the data, according to Justin Westcott, Edelman’s chair of global technology.

“Key among these are fears related to privacy invasion, the potential for AI to devalue human contributions, and apprehensions about unregulated technological leaps outpacing ethical considerations,” Westcott said, adding “the data points to a perceived lack of transparency and accountability in how AI companies operate and engage with societal impacts.”

Technology as a whole is losing its lead in trust among sectors, Edelman said, highlighting the key findings from the study.

“Eight years ago, technology was the leading industry in trust in 90 percent of the countries we study,” researchers wrote, referring to the 28 countries. “Now it is most trusted only in half.”

Westcott argued the findings should be a “wake up call” for AI companies to “build back credibility through ethical innovation, genuine community engagement and partnerships that place people and their concerns at the heart of AI developments.”

As for the impacts on the future for the industry as a whole, “societal acceptance of the technology is now at a crossroads,” he said, adding that trust in AI and the companies producing it should be seen “not just as a challenge, but an opportunity.”

Priorities, Westcott continued, should revolve around ethical practices, transparency and a “relentless focus” on the benefits to society AI can provide…(More)”.

Unconventional data, unprecedented insights: leveraging non-traditional data during a pandemic


Paper by Kaylin Bolt et al: “The COVID-19 pandemic prompted new interest in non-traditional data sources to inform response efforts and mitigate knowledge gaps. While non-traditional data offers some advantages over traditional data, it also raises concerns related to biases, representativity, informed consent and security vulnerabilities. This study focuses on three specific types of non-traditional data: mobility, social media, and participatory surveillance platform data. Qualitative results are presented on the successes, challenges, and recommendations of key informants who used these non-traditional data sources during the COVID-19 pandemic in Spain and Italy….

Non-traditional data proved valuable in providing rapid results and filling data gaps, especially when traditional data faced delays. Increased data access and innovative collaborative efforts across sectors facilitated its use. Challenges included unreliable access and data quality concerns, particularly the lack of comprehensive demographic and geographic information. To further leverage non-traditional data, participants recommended prioritizing data governance, establishing data brokers, and sustaining multi-institutional collaborations. The value of non-traditional data was perceived as underutilized in public health surveillance, program evaluation and policymaking. Participants saw opportunities to integrate them into public health systems with the necessary investments in data pipelines, infrastructure, and technical capacity…(More)”.

The AI data scraping challenge:  How can we proceed responsibly?


Article by Lee Tiedrich: “Society faces an urgent and complex artificial intelligence (AI) data scraping challenge.  Left unsolved, it could threaten responsible AI innovation.  Data scraping refers to using web crawlers or other means to obtain data from third-party websites or social media properties.  Today’s large language models (LLMs) depend on vast amounts of scraped data for training and potentially other purposes.  Scraped data can include facts, creative content, computer code, personal information, brands, and just about anything else.  At least some LLM operators directly scrape data from third-party sites.  Common CrawlLAION, and other sites make scraped data readily accessible.  Meanwhile, Bright Data and others offer scraped data for a fee. 

In addition to fueling commercial LLMs, scraped data can provide researchers with much-needed data to advance social good.  For instance, Environmental Journal explains how scraped data enhances sustainability analysis.  Nature reports that scraped data improves research about opioid-related deaths.  Training data in different languages can help make AI more accessible for users in Africa and other underserved regions.  Access to training data can even advance the OECD AI Principles by improving safety and reducing bias and other harms, particularly when such data is suitable for the AI system’s intended purpose…(More)”.

The Computable City: Histories, Technologies, Stories, Predictions


Book by Michael Batty: “At every stage in the history of computers and communications, it is safe to say we have been unable to predict what happens next. When computers first appeared nearly seventy-five years ago, primitive computer models were used to help understand and plan cities, but as computers became faster, smaller, more powerful, and ever more ubiquitous, cities themselves began to embrace them. As a result, the smart city emerged. In The Computable City, Michael Batty investigates the circularity of this peculiar evolution: how computers and communications changed the very nature of our city models, which, in turn, are used to simulate systems composed of those same computers.

Batty first charts the origins of computers and examines how our computational urban models have developed and how they have been enriched by computer graphics. He then explores the sequence of digital revolutions and how they are converging, focusing on continual changes in new technologies, as well as the twenty-first-century surge in social media, platform economies, and the planning of the smart city. He concludes by revisiting the digital transformation as it continues to confound us, with the understanding that the city, now a high-frequency twenty-four-hour version of itself, changes our understanding of what is possible…(More)”.