AI and the future of policy evaluation

Blog by PUBLIC: “Across the M&E lifecycle, we are already seeing real, deployable applications of AI tools.

AI-assisted evidence synthesis is probably the most mature area. Tools can now search, screen, and summarise bodies of literature at a scale that would take human teams weeks. For evaluation teams scoping a new programme area, or interested in exploring what some other field could say about their topic, this is genuinely useful today.

A recent example of this is the development of systems like InsightAgent, a multi-agent framework designed for complex systematic reviews. Researchers demonstrated that this tool could partition a massive amount of literature, read and synthesise findings, and draft a rigorous review in just 1.5 hours – a process that traditionally takes months to complete manually. Researchers could also visually monitor the AI’s reading trajectory, adjust its inclusion criteria, and verify its sources in real-time.

AI-led qualitative interviews – including voice – have been shown to generate substantially richer responses than conventional open text fields. For public sector evaluations, the possibility of running qualitative research at a fraction of the cost is a meaningful shift. Similarly, these practices are effective where there are multiple layers of governance – such as evaluation framework development and qualitative evaluations of ‘unmonetisable’ outcomes, as per the Green Book.

For example, PUBLIC recently utilised Salomo to conduct user research for a major public sector project. Traditionally, gathering and synthesising user research at this scale would take a team of multiple researchers many months to complete. However, by leveraging Salomo’s agentic capabilities, a team of just two researchers was able to process, code, and extract insights from 100 interviews in less than a week.

Getting to concrete outputs and models more quickly. Analysis and reporting workflows are starting to allow evaluators to go from a research question to a documented, reproducible output – with code, findings, and visualisations – in a fraction of the time previously required.

For example, AI Scientist-V2, is a system capable of automating the scientific research lifecycle. Given a high-level prompt, the agent autonomously formulates hypotheses, writes and debugs experiment code, visualises data, and drafts a complete manuscript in under 15 hours. It also recently produced a research paper that successfully passed a double-blind peer review.

While public sector policy evaluation has its own unique complexities and stakeholder dynamics, the implication is clear. These are tools that can handle the heavy mechanical execution – running the econometrics, generating charts, and drafting technical annexes – freeing up evaluators to focus on the harder interpretive questions and policy implications…(More)”.