Explore our articles
View All Results

Stefaan Verhulst

Blog by Divya Siddarth: “Evaluations are quietly shaping AI. Results can move billions in investment decisions, set regulation, and influence public trust. Yet most evals tell us little about how AI systems perform in and impact the real world. At CIP we are exploring ways that collective input (public, domain expert, and regional) can help solve this. Rough thoughts below.

1. Evaluation needs to be highly context specific, which is hard. Labs have built challenging benchmarks for reasoning and generalization (ARC-AGI, GPQA, etc.), but most still focus on decontextualized problems. What they miss is how models perform in situated use: sustaining multi-hour therapy conversations, tutoring children around the world across languages, mediating policy, and shaping political discourse in real time. These contexts redefine what ‘good performance’ means.

2. Technical details can swing results. Prompt phrasing, temperature settings, even enumeration style can cause substantial performance variations. Major investment and governance decisions are being made based on measurements that are especially sensitive to implementation details. We’ve previously written about some of these challenges and ways to address them.

3. Fruitful comparison is almost impossible. Model cards list hundreds of evaluations, but without standardized documentation in the form of prompts, parameters, and procedures, it’s scientifically questionable to compare across models. We can’t distinguish genuine differences from evaluation artifacts.

4. Evals are fragmented and no single entity is positioned to solve this. Labs run proprietary internal evals, and academic efforts are often static and buried in research papers and github repos. They also can’t build evals for every possible context and domain worldwide. Third-party evaluations only measure what they’re hired to measure. Academic benchmarks often become outdated. In practice, we can think of evals in three categories:

  • Capability evals (reasoning, coding, math), which measure raw problem-solving.
  • Risk evals (jailbreaks, alignment, misuse), which probe safety and misuse potential
  • Contextual evals (domain- or culture-specific), which test performance in particular settings…(More)”.
Notes on building collective intelligence into evals

Worldbank Report: “The transformative potential of artificial intelligence (AI) in public governance is increasingly recognized across both developed and developing economies. Governments are exploring and adopting AI technologies to enhance service delivery, streamline administrative efficiency, and strengthen data-driven decision-making. However, the integration of AI into public systems also introduces ethical, technical, and institutional challenges – ranging from algorithmic bias and lack of transparency to data privacy concerns and regulatory fragmentation. These challenges are especially salient in public sector contexts, where trust, accountability, and equity are crucial. This paper addresses a central question: How can public institutions adopt AI responsibly while safeguarding privacy, promoting fairness, and ensuring accountability? In particular, it focuses on the readiness of government agencies to implement AI technologies in a trustworthy and responsible manner. This paper responds to that gap by providing both conceptual grounding and practical tools to support implementation. First, it synthesizes key ethical considerations and international frameworks that underpin trustworthy AI governance. Second, it introduces relevant technical solutions, including explain ability models, privacy-enhancing technologies, and algorithmic fairness approaches, that can mitigate emerging risks in AI deployment. Third, it presents a self-assessment toolkit for public institutions: a decision flowchart for AI application and a data privacy readiness checklist. These tools are designed to help public sector actors evaluate their preparedness, identify institutional gaps, and inform internal coordination processes prior to AI adoption. By bridging theory and practice, this paper contributes to ongoing global efforts to build trustworthy AI that is lawful, ethical, inclusive, and institutionally grounded…(More)”.

Building Trustworthy Artificial Intelligence: Frameworks, Applications, and Self-Assessment for Readiness

Paper by Deininger, Klaus et al: “This paper explores whether satellite imagery can be used to derive a measure to estimate conflict-induced damage to agricultural production and compare the results to those obtained using media-based conflict indicators, which are widely used in the literature. The paper combines area for summer and winter crops from annual crop maps for 2019–24 with measures of conflict-related damage to agricultural land based on optical and thermal satellite sensors. These data are used to estimate a difference-in-differences model for close to 10,000 Ukrainian village councils. The results point to large and persistent negative effects that spill over to conflict-unaffected village councils. The predicted impact is three times larger, with a distinctly different distribution across key domains (for example, territory controlled by Ukraine and the Russian Federation) using the preferred image-based indicator as compared to a media-based indicator. Satellite imagery thus allows defining conflict incidence in ways that may be relevant to agricultural production and that may have implications for future research…(More)”.

Using Remotely Sensed Data to Assess War-Induced Damage to Agricultural Cultivation: Evidence from Ukraine

Press Release and blog by Mykhailo Fedorov: “Ukraine is betting on artificial intelligence — and this is not just a trend. It is our clear and defined mission: by 2030, we aim to become one of the world’s top three countries in terms of AI development and integration in the public sector.

This week, we took another major step toward that goal — we launched Diia.AI on the Diia portal. It is the world’s first national AI-agent that goes beyond answering questions — it actually provides government services directly within a chat. The AI assistant is now available in open beta, and users can already receive the first service through AI — an income certificate. New services will be rolled out gradually as the AI develops.

Our focus is to transform Diia from a digital services platform into a fully functional AI-agent that operates 24/7, without the need to manually fill out forms or fields. Diia is becoming a proactive assistant in the citizen–state relationship. The AI-agent doesn’t simply act as a chatbot that responds to queries — it takes action based on the user’s request. For example, you write to the assistant in the chat: «I need an income certificate», and receive it directly in your personal account on the Diia portal, with an email notification once it’s ready.

AI agents represent the cutting edge of artificial intelligence, fundamentally changing the way services are accessed globally. The future lies with agentic states, and Ukraine is boldly advancing toward this format — where a single user request leads directly to results. AI agents act as personal digital assistants, independently building action plans, initiating service requests, and autonomously executing all stages of task completion…(More)”.

Diia.AI: The World’s First National AI-Agent That Delivers Real Government Services

Article by Amer Sinha and Ryan McKenna: “As AI becomes more integrated into our lives, building it with privacy at its core is a critical frontier for the field. Differential privacy (DP) offers a mathematically sound solution by adding calibrated noise to prevent memorization. However, applying DP to LLMs introduces trade-offs. Understanding these trade-offs is crucial. Applying DP noise alters traditional scaling laws — rules describing performance dynamics — by reducing training stability (the model’s ability to learn consistently without experiencing catastrophic events like loss spikes or divergence) and significantly increasing batch size (a collection of training examples sent to the model simultaneously for processing) and computation costs.

Our new research, “Scaling Laws for Differentially Private Language Models”, conducted in partnership with Google DeepMind, establishes laws that accurately model these intricacies, providing a complete picture of the compute-privacy-utility trade-offs. Guided by this research, we’re excited to introduce VaultGemma, the largest (1B-parameters), open model trained from scratch with differential privacy. We are releasing the weights on Hugging Face and Kaggle, alongside a technical report, to advance the development of the next generation of private AI…

Armed with our new scaling laws and advanced training algorithms, we built VaultGemma, to date the largest (1B-parameters) open model fully pre-trained with differential privacy with an approach that can yield high-utility models…(More)”.

VaultGemma: The world’s most capable differentially private LLM

Report by Seiling, LK et al: “As digital platforms play an increasingly prominent role in societies around the globe, calls from policymakers, civil society, and the public for transparency, accountability and evidence-based regulation of these digital services have become louder and more urgent. Independent research seeking to provide such empirical evidence has either taken place in a legal gray zone, running the risk of legal retaliation, or depended on close collaboration with platforms. The Digital Services Act (DSA), adopted in 2022 and in force since 2024, promises to change this dynamic by clearly outlining under which conditions platforms must grant data access to researchers. The recently adopted Delegated Act on data access (DA) provided more detail on the implementation of this new right to data access for researchers.
This paper provides an overview of researchers’ initial practical experience with access to publicly available data based on Art. 40(12) DSA as well as an in-depth description of procedure for access as set out in Art. 40(4) DSA, thereby comprehensively characterising the data access options outlined in the DSA and DA. We outline key provisions and their underlying rationales to provide an overview of the goals, procedures and limits of DSA-based data access, as well as an account of external factors likely to weigh in its realisation. The goal is to offer a valuable point of reference for the European as well as global community of researchers considering applications under the DSA, as well as other stakeholders aiming to understand or support the development of robust data access frameworks…(More)”.

Data Access for Researchers under the Digital Services Act: From Policy to Practice

Article by John Gautam: “In systems of social change, we grapple with an enduring tension: connection versus abstraction. Connection is slow, human, and relational. It thrives on trust, listening, and collaboration. Abstraction, on the other hand, simplifies complexity into patterns, insights, and models. It is fast, scalable, and efficient.

Both serve a purpose, but they pull in opposite directions. And now, with the rise of AI tools like large language models (LLMs), this tension has reached new heights. LLMs thrive on abstraction; they reduce human interaction into data points, surface patterns, and generate outputs.

While LLMs are not intelligent in the sense of reasoning or self-awareness, they can serve as tools that reframe, rephrase, and reorganise a person’s ideas in ways that feel expressive. This can enable creativity and reflection, but let’s be clear: It’s not agency. The tool reshapes inputs but does not make meaning…(More)”.

The limits of AI in social change

Paper by Heather Openshaw: “Governments are collecting and storing vast amounts of data. The majority of data collected globally is considered dark data. This is unused, unanalyzed, and largely unstructured data that simultaneously burns budget holes and weighs down potential as an untapped resource. Dark data can exist across all forms of data collection, from IoT device logs and sensor metadata to historical paper archives and unlabeled multimedia files. Every department of a government is affected, from legal, health, financial, intelligence and so on.

While high-income countries are beginning to invest in tools, such as artificial intelligence (AI), and governance frameworks to surface and use this data, low- and middle-income countries (LMICs) often lack the institutional infrastructure, technical capacity, and legal safeguards needed to do the same…(More)”.

Bringing Light to Government Dark Data in the Age of AI

Article by Enrique Segura: “Smart cities are no longer just about sensors and data. Today, artificial intelligence is helping cities worldwide improve urban life for its citizens in innovative ways while saving money and delivering services in faster, more efficient ways. Whether through government WhatsApp chatbots, graffiti detection or urban tree health monitoring, AI is reshaping the way cities work.

City chatbot with AI

Back in 2019, the Buenos Aires city government launched Boti, a WhatsApp chatbot originally designed to share COVID-19 updates. Since then, Boti has evolved into a citywide digital assistant. It now processes images sent by users (such as license plates for parking violations), alerts citizens of any real-time event, and allows residents to report crimes directly from WhatsApp. With its conversational tone, Boti is designed for locals but also supports English, making it also useful for visitors.

Its success demonstrates how AI-powered communication tools can strengthen trust and streamline services in urban environments.

AI for graffiti detection

Meanwhile, cities like Lisbon and Tempe, Arizona, are piloting AI-powered vision models to detect and map graffiti. By analyzing real-time data from cameras mounted on vehicles or drones, these advanced systems can spot new graffiti in real time, geo-tag affected areas and help city teams respond more quickly. This means city workers no longer must rely solely on citizen reports; instead, they can prioritize areas based on data-driven insights.

This proactive use of AI not only saves time and resources but also contributes for cleaner and safer cities. 

AI for urban tree health

Tokyo is leveraging AI to monitor and protect its urban trees through the Plant Doctor system, developed by Waseda University and Ryukoku University. Using advanced computer vision powered by YOLOv8, DeepSORT and DeepLabV3+, the system analyzes images of street trees to detect signs of disease or pest damage.

Mounted on drones or vehicles, Plant Doctor tracks the health of individual leaves and enables proactive care. This ensures healthier urban forests, reduces costly maintenance and enhances the quality of public places in the city. 

Smarter cities, better services for citizens

In New York City, an AI company uses crowdsourced dashcam imagery for crosswalk inspections, enabling its model to analyze the conditions of individual paint lines and track conditions over time. Shanghai and Singapore have developed digital twins, allowing them to model the impact of urban planning efforts, such as new construction or mobility improvements. From the minute to the massive, AI is proving to be a powerful ally in city management…(More)”.

Civic intelligence: How AI is powering smarter cities

Paper by Jorrit de Jong et al: “Over the last decades, scholars and practitioners have focused their attention on the use of data for improving public action, with a renewed interest in the emergence of big data and artificial intelligence. The potential of data is particularly salient in cities, where vast amounts of data are being generated from traditional and novel sources. Despite this growing interest, there is a need for a conceptual and operational understanding of the beneficial uses of data. This article presents a comprehensive and precise account of how cities can use data to address problems more effectively, efficiently, equitably, and in a more accountable manner. It does so by synthesizing and augmenting current research with empirical evidence derived from original research and learnings from a program designed to strengthen city governments’ data capacity. The framework can be used to support longitudinal and comparative analyses as well as explore questions such as how different uses of data employed at various levels of maturity can yield disparate outcomes. Practitioners can use the framework to identify and prioritize areas in which building data capacity might further the goals of their teams and organizations…(More)”.

The Data-Informed City: A Conceptual Framework for Advancing Research and Practice

Get the latest news right in your inbox

Subscribe to curated findings and actionable knowledge from The Living Library, delivered to your inbox every Friday