How to Regulate AI? Start With the Data


Article by Susan Ariel Aaronson: “We live in an era of data dichotomy. On one hand, AI developers rely on large data sets to “train” their systems about the world and respond to user questions. These data troves have become increasingly valuable and visible. On the other hand, despite the import of data, U.S. policy makers don’t view data governance as a vehicle to regulate AI.  

U.S. policy makers should reconsider that perspective. As an example, the European Union, and more than 30 other countries, provide their citizens with a right not to be subject to automated decision making without explicit consent. Data governance is clearly an effective way to regulate AI.

Many AI developers treat data as an afterthought, but how AI firms collect and use data can tell you a lot about the quality of the AI services they produce. Firms and researchers struggle to collect, classify, and label data sets that are large enough to reflect the real world, but then don’t adequately clean (remove anomalies or problematic data) and check their data. Also, few AI developers and deployers divulge information about the data they use to train AI systems. As a result, we don’t know if the data that underlies many prominent AI systems is complete, consistent, or accurate. We also don’t know where that data comes from (its provenance). Without such information, users don’t know if they should trust the results they obtain from AI. 

The Washington Post set out to document this problem. It collaborated with the Allen Institute for AI to examine Google’s C4 data set, a widely used and large learning model built on data scraped by bots from 15 million websites. Google then filters the data, but it understandably can’t filter the entire data set.  

Hence, this data set provides sufficient training data, but it also presents major risks for those firms or researchers who rely on it. Web scraping is generally legal in most countries as long as the scraped data isn’t used to cause harm to society, a firm, or an individual. But the Post found that the data set contained swaths of data from sites that sell pirated or counterfeit data, which the Federal Trade Commission views as harmful. Moreover, to be legal, the scraped data should not include personal data obtained without user consent or proprietary data obtained without firm permission. Yet the Post found large amounts of personal data in the data sets as well as some 200 million instances of copyrighted data denoted with the copyright symbol.

Reliance on scraped data sets presents other risks. Without careful examination of the data sets, the firms relying on that data and their clients cannot know if it contains incomplete or inaccurate data, which in turn could lead to problems of bias, propaganda, and misinformation. But researchers cannot check data accuracy without information about data provenance. Consequently, the firms that rely on such unverified data are creating some of the AI risks regulators hope to avoid. 

It makes sense for Congress to start with data as it seeks to govern AI. There are several steps Congress could take…(More)”.

Destination? Care Blocks!


Blog by Natalia González Alarcón, Hannah Chafetz, Diana Rodríguez Franco, Uma Kalkar, Bapu Vaitla, & Stefaan G. Verhulst: “Time poverty” caused by unpaid care work overload, such as washing, cleaning, cooking, and caring for their care-receivers is a structural consequence of gender inequality. In the City of Bogotá, 1.2 million women — 30% of their total women’s population — carry out unpaid care work full-time. If such work was compensated, it would represent 13% of Bogotá’s GDP and 20% of the country’s GDP. Moreover, the care burden falls disproportionately on women’s shoulder and prevents them from furthering their education, achieving financial autonomy, participating in their community, and tending to their personal wellbeing.

To address the care burden and its spillover consequences on women’s economic autonomy, well-being and political participation, in October 2020, Bogotá Mayor Claudia López launched the Care Block Initiative. Care Blocks, or Manzanas del cuidado, are centralized areas for women’s economic, social, medical, educational, and personal well-being and advancement. They provide services simultaneously for caregivers and care-receivers.

As the program expands from 19 existing Care Blocks to 45 Care Blocks by the end of 2035, decision-makers face another issue: mobility is a critical and often limiting factor for women when accessing Care Blocks in Bogotá.

On May 19th, 2023, The GovLabData2X, and the Secretariat for Women’s Affairs, in the City Government of Bogotá co-hosted a studio that aimed to scope a purposeful and gender-conscious data collaborative that addresses mobility-related issues affecting the access of Care Blocks in Bogotá. Convening experts across the gender, mobility, policy, and data ecosystems, the studio focused on (1) prioritizing the critical questions as it relates to mobility and access to Care Blocks and (2) identifying the data sources and actors that could be tapped into to set up a new data collaborative…(More)”.

Artificial Intelligence, Big Data, Algorithmic Management, and Labor Law


Chapter by Pauline Kim: “Employers are increasingly relying on algorithms and AI to manage their workforces, using automated systems to recruit, screen, select, supervise, discipline, and even terminate employees. This chapter explores the effects of these systems on the rights of workers in standard work relationships, who are presumptively protected by labor laws. It examines how these new technological tools affect fundamental worker interests and how existing law applies, focusing primarily as examples on two particular concerns—nondiscrimination and privacy. Although current law provides some protections, legal doctrine has largely developed with human managers in mind, and as a result, fails to fully apprehend the risks posed by algorithmic tools. Thus, while anti-discrimination law prohibits discrimination by workplace algorithms, the existing framework has a number of gaps and uncertainties when applied to these systems. Similarly, traditional protections for employee privacy are ill-equipped to address the sheer volume and granularity of worker data that can now be collected, and the ability of computational techniques to extract new insights and infer sensitive information from that data. More generally, the expansion of algorithmic management affects other fundamental worker interests because it tends to increase employer power vis à vis labor. This chapter concludes by briefly considering the role that data protection laws might play in addressing the risks of algorithmic management…(More)”.

How Leaders in Higher Education Can Embed Behavioral Science in Their Institutions


Essay by Ross E. O’Hara: “…Once we view student success through a behavioral science lens and see the complex systems underlying student decision making, it becomes clear that behavioral scientists work best not as mechanics who repair broken systems, but as engineers who design better systems. Higher education, therefore, needs to diffuse those engineers throughout the organization.

To that end, Hallsworth recommends that organizations change their view of behavioral science “from projects to processes, from commissions to culture.” Only when behavioral science expertise is diffused across units and incorporated into all key organizational functions can a college become behaviorally enabled. So how might higher education go about this transformation?

1. Leverage the faculty

Leaders with deep expertise in behavioral science are likely already employed in social and behavioral sciences departments. Consider ways to focus their energy inward to tackle institutional challenges, perhaps using their own classrooms or departments as testing grounds. As they find promising solutions, build the infrastructure to disseminate and implement those ideas college and system wide. Unlike higher education’s normal approach—giving faculty additional unpaid and underappreciated committee work—provide funding and recognition that incentivizes faculty to make higher education policy an important piece of their academic portfolio.

2. Practice cross-functional training

I have spent the past several years providing colleges with behavioral science professional development, but too often this work is focused on a single functional unit, like academic advisors or faculty. Instead, create trainings that include representatives from across campus (e.g., enrollment; financial aid; registrar; student affairs). Not only will this diffuse behavioral science knowledge across the institution, but it will bring together the key players that impact student experience and make it easier for them to see the adaptive system that determines whether a student graduates or withdraws.

3. Let behavioral scientists be engineers

Whether you look for faculty or outside consultants, bring behavioral science experts into conversations early. From redesigning college-to-career pathways to building a new cafeteria, behavioral scientists can help gather and interpret student voices, foresee and circumvent behavioral challenges, and identify measurable and meaningful evaluation metrics. The impact of their expertise will be even greater when they work in an environment with a diffuse knowledge of behavioral science already in place…(More)”

Gamifying medical data labeling to advance AI


Article by Zach Winn: “…Duhaime began exploring ways to leverage collective intelligence to improve medical diagnoses. In one experiment, he trained groups of lay people and medical school students that he describes as “semiexperts” to classify skin conditions, finding that by combining the opinions of the highest performers he could outperform professional dermatologists. He also found that by combining algorithms trained to detect skin cancer with the opinions of experts, he could outperform either method on its own….The DiagnosUs app, which Duhaime developed with Centaur co-founders Zach Rausnitz and Tom Gellatly, is designed to help users test and improve their skills. Duhaime says about half of users are medical school students and the other half are mostly doctors, nurses, and other medical professionals…

The approach stands in sharp contrast to traditional data labeling and AI content moderation, which are typically outsourced to low-resource countries.

Centaur’s approach produces accurate results, too. In a paper with researchers from Brigham and Women’s Hospital, Massachusetts General Hospital (MGH), and Eindhoven University of Technology, Centaur showed its crowdsourced opinions labeled lung ultrasounds as reliably as experts did…

Centaur has found that the best performers come from surprising places. In 2021, to collect expert opinions on EEG patterns, researchers held a contest through the DiagnosUs app at a conference featuring about 50 epileptologists, each with more than 10 years of experience. The organizers made a custom shirt to give to the contest’s winner, who they assumed would be in attendance at the conference.

But when the results came in, a pair of medical students in Ghana, Jeffery Danquah and Andrews Gyabaah, had beaten everyone in attendance. The highest-ranked conference attendee had come in ninth…(More)”

“How Democracy Should Work” Lesson in Learning, Building Cohesion and Community


Case study by Marjan Horst Ehsassi: “Something special happened in a small community just north of San Francisco during the summer of 2022. The city of Petaluma decided to do democracy a bit differently. To figure out what to do about a seemingly-intractable local issue, the city of 60,000 decided policymakers and “experts” shouldn’t be the only ones at the decision-making table—residents of Petaluma also ought to have a voice. They would do this by instituting a Citizens’ Assembly—the first of its kind in California.

Citizens’ Assemblies and sortition are not new ideas; in fact, they’ve helped citizens engage in decision-making since Ancient Greece. Yet only recently did they resurge as a possible antidote to a representative democracy that no longer reflects citizens’ preferences and pervasive citizen disengagement from political institutions. Also referred to as lottery-selected panels or citizens’ panels, this deliberative platform has gained popularity in Western Europe but is only just beginning to make inroads in the United States. The Petaluma City Council’s decision to invite Healthy Democracy (healthydemocracy.org), a leading U.S. organization dedicated to designing and implementing deliberative democracy programs, to convene a citizens’ assembly on the future of a large plot of public land, demonstrates unique political vision and will. This decision contributes to a roadmap for innovative ways to engage with citizens.

This case study examines this novel moment of democratic experimentation in California, which became known as the Petaluma Fairgrounds Advisory Panel (PFAP). It begins with a description of the context, a summary of the PFAP’s design, composition, and process, and a discussion of the role of the government-lead or sponsor, the Petaluma City Council. An analysis of the impact of participation on the Panelist using a methodology developed by the author in several other case studies follows. Finally, the last section provides several recommendations to enhance the impact of such processes as well as thoughts on the future of deliberative platforms…(More)”.

How data helped Mexico City reduce high-impact crime by more than 50%


Article by Alfredo Molina Ledesma: “When Claudia Sheimbaum Pardo became Mayor of Mexico City 2018, she wanted a new approach to tackling the city’s most pressing problems. Crime was at the very top of the agenda – only 7% of the city’s inhabitants considered it a safe place. New policies were needed to turn this around.

Data became a central part of the city’s new strategy. The Digital Agency for Public Innovation was created in 2019 – tasked with using data to help transform the city. To put this into action, the city administration immediately implemented an open data policy and launched their official data platform, Portal de Datos Abiertos. The policy and platform aimed to make data that Mexico City collects accessible to anyone: municipal agencies, businesses, academics, and ordinary people.

“The main objective of the open data strategy of Mexico City is to enable more people to make use of the data generated by the government in a simple and interactive manner,” said Jose Merino, Head of the Digital Agency for Public Innovation. “In other words, what we aim for is to democratize the access and use of information.” To achieve this goal a new tool for interactive data visualization called Sistema Ajolote was developed in open source and integrated into the Open Data Portal…

Information that had never been made public before, such as street-level crime from the Attorney General’s Office, is now accessible to everyone. Academics, businesses and civil society organizations can access the data to create solutions and innovations that complement the city’s new policies. One example is the successful “Hoyo de Crimen” app, which proposes safe travel routes based on the latest street-level crime data, enabling people to avoid crime hotspots as they walk or cycle through the city.

Since the introduction of the open data policy – which has contributed to a comprehensive crime reduction and social support strategy – high-impact crime in the city has decreased by 53%, and 43% of Mexico City residents now consider the city to be a safe place…(More)”.

Use of AI in social sciences could mean humans will no longer be needed in data collection


Article by Michael Lee: A team of researchers from four Canadian and American universities say artificial intelligence could replace humans when it comes to collecting data for social science research.

Researchers from the University of Waterloo, University of Toronto, Yale University and the University of Pennsylvania published an article in the journal Science on June 15 about how AI, specifically large language models (LLMs), could affect their work.

“AI models can represent a vast array of human experiences and perspectives, possibly giving them a higher degree of freedom to generate diverse responses than conventional human participant methods, which can help to reduce generalizability concerns in research,” Igor Grossmann, professor of psychology at Waterloo and a co-author of the article, said in a news release.

Philip Tetlock, a psychology professor at UPenn and article co-author, goes so far as to say that LLMs will “revolutionize human-based forecasting” in just three years.

In their article, the authors pose the question: “How can social science research practices be adapted, even reinvented, to harness the power of foundational AI? And how can this be done while ensuring transparent and replicable research?”

The authors say the social sciences have traditionally relied on methods such as questionnaires and observational studies.

But with the ability of LLMs to pore over vast amounts of text data and generate human-like responses, the authors say this presents a “novel” opportunity for researchers to test theories about human behaviour at a faster rate and on a much larger scale.

Scientists could use LLMs to test theories in a simulated environment before applying them in the real world, the article says, or gather differing perspectives on a complex policy issue and generate potential solutions.

“It won’t make sense for humans unassisted by AIs to venture probabilistic judgments in serious policy debates. I put an 90 per cent chance on that,” Tetlock said. “Of course, how humans react to all of that is another matter.”

One issue the authors identified, however, is that LLMs often learn to exclude sociocultural biases, raising the question of whether models are correctly reflecting the populations they study…(More)”

Better Government Tech Is Possible


Article by Beth Noveck: “In the first four months of the Covid-19 pandemic, government leaders paid $100 million for management consultants at McKinsey to model the spread of the coronavirus and build online dashboards to project hospital capacity.

It’s unsurprising that leaders turned to McKinsey for help, given the notorious backwardness of government technology. Our everyday experience with online shopping and search only highlights the stark contrast between user-friendly interfaces and the frustrating inefficiencies of government websites—or worse yet, the ongoing need to visit a government office to submit forms in person. The 2016 animated movie Zootopia depicts literal sloths running the DMV, a scene that was guaranteed to get laughs given our low expectations of government responsiveness.

More seriously, these doubts are reflected in the plummeting levels of public trust in government. From early Healthcare.gov failures to the more recent implosions of state unemployment websites, policymaking without attention to the technology that puts the policy into practice has led to disastrous consequences.

The root of the problem is that the government, the largest employer in the US, does not keep its employees up-to-date on the latest tools and technologies. When I served in the Obama White House as the nation’s first deputy chief technology officer, I had to learn constitutional basics and watch annual training videos on sexual harassment and cybersecurity. But I was never required to take a course on how to use technology to serve citizens and solve problems. In fact, the last significant legislation about what public professionals need to know was the Government Employee Training Act, from 1958, well before the internet was invented.

In the United States, public sector awareness of how to use data or human-centered design is very low. Out of 400-plus public servants surveyed in 2020, less than 25 percent received training in these more tech-enabled ways of working, though 70 percent said they wanted such training…(More)”.

Fighting poverty with synthetic data


Article by Jack Gisby, Anna Kiknadze, Thomas Mitterling, and Isabell Roitner-Fransecky: “If you have ever used a smartwatch or other wearable tech to track your steps, heart rate, or sleep, you are part of the “quantified self” movement. You are voluntarily submitting millions of intimate data points for collection and analysis. The Economist highlighted the benefits of good quality personal health and wellness data—increased physical activity, more efficient healthcare, and constant monitoring of chronic conditions. However, not everyone is enthusiastic about this trend. Many fear corporations will use the data to discriminate against the poor and vulnerable. For example, insurance firms could exclude patients based on preconditions obtained from personal data sharing.

Can we strike a balance between protecting the privacy of individuals and gathering valuable information? This blog explores applying a synthetic populations approach in New York City,  a city with an established reputation for using big data approaches to support urban management, including for welfare provisions and targeted policy interventions.

To better understand poverty rates at the census tract level, World Data Lab, with the support of the Sloan Foundation, generated a synthetic population based on the borough of Brooklyn. Synthetic populations rely on a combination of microdata and summary statistics:

  • Microdata consists of personal information at the individual level. In the U.S., such data is available at the Public Use Microdata Area (PUMA) level. PUMA are geographic areas partitioning the state, containing no fewer than 100,000 people each. However, due to privacy concerns, microdata is unavailable at the more granular census tract level. Microdata consists of both household and individual-level information, including last year’s household income, the household size, the number of rooms, and the age, sex, and educational attainment of each individual living in the household.
  • Summary statistics are based on populations rather than individuals and are available at the census tract level, given that there are fewer privacy concerns. Census tracts are small statistical subdivisions of a county, averaging about 4,000 inhabitants. In New York City, a census tract roughly equals a building block. Similar to microdata, summary statistics are available for individuals and households. On the census tract level, we know the total population, the corresponding demographic breakdown, the number of households within different income brackets, the number of households by number of rooms, and other similar variables…(More)”.