collective intelligence

Notes on building collective intelligence into evals

Blog by Divya Siddarth: “Evaluations are quietly shaping AI. Results can move billions in investment decisions, set regulation, and influence public trust. Yet most evals tell us little about how AI systems perform in and impact the real world. At CIP we are exploring ways that collective input (public, domain expert, and regional) can help solve this. Rough thoughts below.

1. Evaluation needs to be highly context specific, which is hard. Labs have built challenging benchmarks for reasoning and generalization (ARC-AGI, GPQA, etc.), but most still focus on decontextualized problems. What they miss is how models perform in situated use: sustaining multi-hour therapy conversations, tutoring children around the world across languages, mediating policy, and shaping political discourse in real time. These contexts redefine what ‘good performance’ means.

2. Technical details can swing results. Prompt phrasing, temperature settings, even enumeration style can cause substantial performance variations. Major investment and governance decisions are being made based on measurements that are especially sensitive to implementation details. We’ve previously written about some of these challenges and ways to address them.

3. Fruitful comparison is almost impossible. Model cards list hundreds of evaluations, but without standardized documentation in the form of prompts, parameters, and procedures, it’s scientifically questionable to compare across models. We can’t distinguish genuine differences from evaluation artifacts.

4. Evals are fragmented and no single entity is positioned to solve this. Labs run proprietary internal evals, and academic efforts are often static and buried in research papers and github repos. They also can’t build evals for every possible context and domain worldwide. Third-party evaluations only measure what they’re hired to measure. Academic benchmarks often become outdated. In practice, we can think of evals in three categories:

Capability evals (reasoning, coding, math), which measure raw problem-solving.
Risk evals (jailbreaks, alignment, misuse), which probe safety and misuse potential
Contextual evals (domain- or culture-specific), which test performance in particular settings…(More)”.