The AI data scraping challenge:  How can we proceed responsibly?


Article by Lee Tiedrich: “Society faces an urgent and complex artificial intelligence (AI) data scraping challenge.  Left unsolved, it could threaten responsible AI innovation.  Data scraping refers to using web crawlers or other means to obtain data from third-party websites or social media properties.  Today’s large language models (LLMs) depend on vast amounts of scraped data for training and potentially other purposes.  Scraped data can include facts, creative content, computer code, personal information, brands, and just about anything else.  At least some LLM operators directly scrape data from third-party sites.  Common CrawlLAION, and other sites make scraped data readily accessible.  Meanwhile, Bright Data and others offer scraped data for a fee. 

In addition to fueling commercial LLMs, scraped data can provide researchers with much-needed data to advance social good.  For instance, Environmental Journal explains how scraped data enhances sustainability analysis.  Nature reports that scraped data improves research about opioid-related deaths.  Training data in different languages can help make AI more accessible for users in Africa and other underserved regions.  Access to training data can even advance the OECD AI Principles by improving safety and reducing bias and other harms, particularly when such data is suitable for the AI system’s intended purpose…(More)”.

Societal challenges and big qualitative data require a new era of methodological pragmatism


Blog by Alex Gillespie, Vlad Glăveanu, and Constance de Saint-Laurent: “The ‘classic’ methods we use today in psychology and the social sciences might seem relatively fixed, but they are the product of collective responses to concerns within a historical context. The 20th century methods of questionnaires and interviews made sense in a world where researchers did not have access to what people did or said, and even if they did, could not analyse it at scale. Questionnaires and interviews were suited to 20th century concerns (shaped by colonialism, capitalism, and the ideological battles of the Cold War) for understanding, classifying, and mapping opinions and beliefs.

However, what social scientists are faced with today is different due to the culmination of two historical trends. The first has to do with the nature of the problems we face. Inequalities, the climate emergency and current wars are compounded by a general rise in nationalism, populism, and especially post-truth discourses and ideologies. Nationalism and populism are not new, but the scale and sophistication of misinformation threatens to undermine collective responses to collective problems.

It is often said that we live in the age of ‘big data’, but what is less often said is that this is in fact the age of ‘big qualitative data’.

The second trend refers to technology and its accelerated development, especially the unprecedented accumulation of naturally occurring data (digital footprints) combined with increasingly powerful methods for data analysis (traditional and generative AI). It is often said that we live in the age of ‘big data’, but what is less often said is that this is in fact the age of ‘big qualitative data’. The biggest datasets are unstructured qualitative data (each minute adds 2.5 million Google text searches, 500 thousand photos on Snapchat, 500 hours of YouTube videos) and the most significant AI advances leverage this qualitative data and make it tractable for social research.

These two trends have been fuelling the rise in mixed methods research…(More)” (See also their new book ‘Pragmatism and Methodology’ (open access)

Evaluating LLMs Through a Federated, Scenario-Writing Approach


Article by Bogdana “Bobi” Rakova: “What do screenwriters, AI builders, researchers, and survivors of gender-based violence have in common? I’d argue they all imagine new, safe, compassionate, and empowering approaches to building understanding.

In partnership with Kwanele South Africa, I lead an interdisciplinary team, exploring this commonality in the context of evaluating large language models (LLMs) — more specifically, chatbots that provide legal and social assistance in a critical context. The outcomes of our engagement are a series of evaluation objectives and scenarios that contribute to an evaluation protocol with the core tenet that when we design for the most vulnerable, we create better futures for everyone. In what follows I describe our process. I hope this methodological approach and our early findings will inspire other evaluation efforts to meaningfully center the margins in building more positive futures that work for everyone…(More)”

Why Do Universities Ignore Good Ideas?


Article by Jeffrey Funk: “Here is a recent assessment of 2023 Nobel Prize Winner Katalin Kariko:

“Eight current and former colleagues of Karikó told The Daily Pennsylvanian that — over the course of three decades — the university repeatedly shunned Karikó and her research, despite its groundbreaking potential.”

Another article claims that this occurred because she could not get the financial support to continue her research.

Why couldn’t she get financial support? “You’re more likely to get grants if you’re a tenured faculty member, but you’re more likely to get promoted to tenure if you get grants,” said Eric Feigl-Ding, an epidemiologist at the New England Complex Systems Institute and a former faculty member and researcher at Harvard Medical School. “There is a vicious cycle,” he says.

Interesting. So, the idea doesn’t matter. What matters to funding agencies is that you have previously obtained funding or are a tenured professor. Really? Are funding agencies this narrow-minded?

Mr. Feigl-Ding also said, “Universities also tend to look at how much a researcher publishes, or how widely covered by the media their work is, as opposed to how innovative the research is.” But why couldn’t Karikó get published?

Science magazine tells the story of her main paper with Drew Weismann in 2005. After being rejected by Nature within 24 hours: “It was similarly rejected by Science and by Cell, and the word incremental kept cropping up in the editorial staff comments.”

Incremental? There are more than two million papers published each year, and this research, for which Karikó and Weismann won a Nobel Prize, was deemed incremental? If it had been rejected for methods or for the contents being impossible to believe, I think most people could understand the rejection. But incremental?

Obviously, most of the two million papers published each year are really incremental. Yet one of the few papers that we can all agree was not incremental, gets rejected because it was deemed incremental.

Furthermore, this is happening in a system of science in which even Nature admits “disruptive science has declined,” few science-based technologies are being successfully commercialized, and Nature admits that it doesn’t understand why…(More)”.

Public sector capacity matters, but what is it?


Blog by Rainer Kattel, Marriana Mazzucato, Rosie Collington, Fernando Fernandez-Monge, Iacopo Gronchi, Ruth Puttick: “As governments turn increasingly to public sector innovations, challenges, missions and transformative policy initiatives, the need to understand and develop public sector capacities is ever more important. In IIPP’s project with Bloomberg Philanthropies to develop a Public Sector Capabilities Index, we propose to define public sector capacities through three inter-connected layers: state capacities, organisational capabilities, and dynamic capabilities of the public organisations.

The idea that governments should be able to design and deliver effective policies has existed ever since we had governments. A quick search in Google’s Ngram viewer shows that the use of state capacity in published books has experienced exponential growth since the late 1980s. It is, however, not a coincidence that focus on state and public sector capacities more broadly emerges in the shadow of new public management and neoliberal governance and policy reforms. Rather than understanding governance as a collaborative effort between all sectors, these reforms gave normative preference to business practices. Increasing focus on public sector capacity as a concept should thus be understood as an attempt to rebalance our understanding of how change happens in societies — through cross-sectoral co-creation — and as an effort to build the muscles in public organisations to work together to tackle socio-economic challenges.

We propose to define public sector capacities through three inter-connected layers: state capacities, organizational routines, and dynamic capabilities of the public organisations…(More)”.

Civic Trust: What’s In A Concept?


Article by Stefaan Verhulst, Andrew J. Zahuranec, Oscar Romero and Kim Ochilo: “We will only be able to improve civic trust once we know how to measure it…

A visualization of the ways to measure civic trust

Recently, there’s been a noticeable decline in trust toward institutions across different sectors of society. This is a serious issue, as evidenced by surveys including the Edelman Trust BarometerGallup, and Pew Research.

Diminishing trust presents substantial obstacles. It threatens to weaken the foundation of a pluralistic democracy, adversely affects public health, and hinders the collaboration needed to tackle worldwide challenges such as climate change. Trust forms the cornerstone of democratic social contracts and is crucial for maintaining the civic agreements essential for the prosperity and cohesion of communities, cities, and countries alike.

Yet to increase civic trust, we need to know what we mean by it and how to measure it, which turns out to be a challenging exercise. Toward that end, The GovLab at New York University and the New York Civic Engagement Commission joined forces to catalog and identify methodologies to quantify and understand the nuances of civic trust.

“Building trust across New York is essential if we want to deepen civic engagement,” said Sarah Sayeed, Chair and Executive Director of the Civic Engagement Commission. “Trust is the cornerstone of a healthy community and robust democracy.”

This blog delves into various strategies for developing metrics to measure civic trust, informed by our own desk research, which categorizes civic trust metrics into descriptive, diagnostic, and evaluative measures…(More)”.

The Importance of Using Proper Research Citations to Encourage Trustworthy News Reporting


Article by Andy Tattersall: “…Understanding the often mysterious processes of how research is picked up and used across different sections of the media is therefore important. To do this we looked at a sample of research that included at least one author from the University of Sheffield that had been cited in either national or local media. We obtained the data from Altmetric.com to explore whether the news story included supporting information that linked readers to the research and those behind it. These were links to any of the authors, their institution, the journal or the research funder. We also investigated how much of this research was available via open access.

National news websites were more likely to include a link to the research paper underpinning the news story.

The contrasts between national and local samples were notable. National news websites were more likely to include a link to the research paper underpinning the news story. National research coverage stories were also more organic. They were more likely to be original texts written by journalists who are credited as authors. This is reflected in more idiosyncratic citation practices. Guardian writers, such as Henry Nicholls and George Monbiot, regularly provided a proper academic citation to the research at the end of their articles. This should be standard practice, but it does require those writing press releases to include formatted citations with a link as a basic first step. 

Local news coverage followed a different pattern, which is likely due to their use of news agencies to provide stories. Much local news coverage relies on copying and pasting subscription content provided by the UK’s national news agency, PA News. Anyone who has visited their local news website in recent years will know that they are full of pop-ups and hyperlinks to adverts and commercial websites. As a result of this business model, local news stories contain no or very few links to the research and those behind the work. Whether any of this practice and the lack of information stems from academic institution and publisher press releases is debatable. 

“Much local news coverage relies on copying and pasting subscription content provided by the UK’s national news agency, PA News.

Further, we found that local coverage of research is often syndicated across multiple news sites, belonging to a few publishers. Consequently if a syndication republishes the same information across their news platforms, it replicates bad practice. A solution to this is to include a readily formatted citation with a link, preferably to an open access version, at the foot of the story. This allows local media to continue linking to third party sites whilst providing an option to explore the actual research paper, especially if that paper is open access…(More)”.

How Mental Health Apps Are Handling Personal Information


Article by Erika Solis: “…Before diving into the privacy policies of mental health apps, it’s necessary to distinguish between “personal information” and “sensitive information,” which are both collected by such apps. Personal information can be defined as information that is “used to distinguish or trace an individual’s identity.” Sensitive information, however, can be any data that, if lost, misused, or illegally modified, may negatively affect an individual’s privacy rights. While health information not under HIPAA has previously been treated as general personal information, states like Washington are implementing strong legislation that will cover a wide range of health data as sensitive, and have attendant stricter guidelines.

Legislation addressing the treatment of personal information and sensitive information varies around the world. Regulations like the General Data Protection Regulation (GDPR) in the EU, for example, require all types of personal information to be treated as being of equal importance, with certain special categories, including health data having slightly elevated levels of protection. Meanwhile, U.S. federal laws are limited in addressing applicable protections of information provided to a third party, so mental health app companies based in the United States can approach personal information in all sorts of ways. For instance, Mindspa, an app with chatbots that are only intended to be used when a user is experiencing an emergency, and Elomia, a mental health app that’s meant to be used at any time, don’t make distinctions between these contexts in their privacy policies. They also don’t distinguish between the potentially different levels of sensitivity associated with ordinary and crisis use.

Wysa, on the other hand, clearly indicates how it protects personal information. Making a distinction between personal and sensitive data, its privacy policy notes that all health-based information receives additional protection. Similarly, Limbic labels everything as personal information but notes that data, including health, genetic, and biometric, fall within a “special category” that requires more explicit consent than other personal information collected to be used…(More)”.

Unlocking Technology for Peacebuilding: The Munich Security Conference’s Role in Empowering a Peacetech Movement


Article by Stefaan Verhulst and Artur Kluz: “This week’s annual Munich Security Conference is taking place amid a turbulent backdrop. The so-called “peace dividend” that followed the end of the Cold War has long since faded. From Ukraine to Sudan to the Middle East, we are living in an era marked by increasingly unstable geopolitics and renewed–and new forms of–violent conflict. Recently, the Uppsala Conflict Data Program, measuring war since 1945, identified 2023 as the worst on record since the Cold War. As the Foreword to the Munich Security Report, issued alongside the Conference, notes: “Unfortunately, this year’s report reflects a downward trend in world politics, marked by an increase in geopolitical tensions and economic uncertainty.”

As we enter deeper into this violent era, it is worth considering the role of technology. It is perhaps no coincidence that a moment of growing peril and division coincides with the increasing penetration of technologies such as smartphones and social media, or with the emergence of new technologies such as artificial intelligence (AI) and virtual reality. In addition, the actions of satellite operators and cross-border digital payment networks have been thrust into the limelight, with their roles in enabling or precipitating conflict attracting increasing scrutiny. Today, it appears increasingly clear that transnational tech actors–and technology itself–are playing a more significant role in geopolitical conflict than ever before. As the Munich Security Report notes, “Technology has gone from being a driver of global prosperity to being a central means of geopolitical competition.”

It doesn’t have to be this way. While much attention is paid to technology’s negative capabilities, this article argues that technology can also play a more positive role, through the contributions of what is sometimes referred to as Peacetech. Peacetech is an emerging field, encompassing technologies as varied as early warning systemsAI driven predictions, and citizen journalism platforms. Broadly, its aims can be described as preventing conflict, mediating disputes, mitigating human suffering, and protecting human dignity and universal human rights. In the words of the United Nations Institute for Disarmament Research (UNIDIR), “Peacetech aims to leverage technology to drive peace while also developing strategies to prevent technology from being used to enable violence.”This article is intended as a call to those attending the Munich Security Conference to prioritize Peacetech — at a global geopolitical forum for peacebuilding. Highlighting recent concerns over the role of technology in conflict–with a particular emphasis on the destructive potential of AI and satellite systems–we argue for technology’s positive potential instead, by promoting peace and mitigating conflict. In particular, we suggest the need for a realignment in how policy and other stakeholders approach and fund technology, to foster its peaceful rather than destructive potential. This realignment would bring out the best in technology; it would harness technology toward the greater public good at a time of rising geopolitical uncertainty and instability…(More)”.

The U.S. Census Is Wrong on Purpose


Blog by David Friedman: “This is a story about data manipulation. But it begins in a small Nebraska town called Monowi that has only one resident, 90 year old Elsie Eiler.

The sign says “Monowi 1,” from Google Street View.

There used to be more people in Monowi. But little by little, the other residents of Monowi left or died. That’s what happened to Elsie’s own family — her children grew up and moved out and her husband passed away in 2004, leaving her as the sole resident. Now she votes for herself for Mayor, and pays herself taxes. Her husband Rudy’s old book collection became the town library, with Elsie as librarian.

But despite what you might imagine, Elsie is far from lonely. She runs a tavern that’s been in her family for 50 years, and has plenty of regulars from the town next door who come by every day to dine and chat.

I first read about Elsie more than 10 years ago. At the time, it wasn’t as well known a story but Elsie has since gotten a lot of coverage and become a bit of a minor celebrity. Now and then I still come across a new article, including a lovely photo essay in the New York Times and a short video on the BBC Travel site.

A Google search reveals many, many similar articles that all tell more or less the same story.

But then suddenly in 2021, there was a new wrinkle: According to the just-published 2020 U.S. Census data, Monowi now had 2 residents, doubling its population.

This came as a surprise to Elsie, who told a local newspaper, “Then someone’s been hiding from me, and there’s nowhere to live but my house.”

It turns out that nobody new had actually moved to Monowi without Elsie realizing. And the census bureau didn’t make a mistake. They intentionally changed the census data, adding one resident.

Why would they do that? Well, it turns out the census bureau sometimes moves residents around on paper in order to protect people’s privacy.

Full census data is only made available 72 years after the census takes place, in accordance with the creatively-named “72 year rule.” Until then, it is only available as aggregated data with individual identifiers removed. Still, if the population of a town is small enough, and census data for that town indicates, for example, that there is just one 90 year old woman and she lives alone, someone could conceivably figure out who that individual is.

So the census bureau sometimes moves people around to create noise in the data that makes that sort of identification a little bit harder…(More)”.