Inspecting Algorithms for Bias

Matthias Spielkamp at MIT Technology Review: “It was a striking story. “Machine Bias,” the headline read, and the teaser proclaimed: “There’s software used across the country to predict future criminals. And it’s biased against blacks.”

ProPublica, a Pulitzer Prize–winning nonprofit news organization, had analyzed risk assessment software known as COMPAS. It is being used to forecast which criminals are most likely to ­reoffend. Guided by such forecasts, judges in courtrooms throughout the United States make decisions about the future of defendants and convicts, determining everything from bail amounts to sentences. When ProPublica compared COMPAS’s risk assessments for more than 10,000 people arrested in one Florida county with how often those people actually went on to reoffend, it discovered that the algorithm “correctly predicted recidivism for black and white defendants at roughly the same rate.”…

After ProPublica’s investigation, Northpointe, the company that developed COMPAS, disputed the story, arguing that the journalists misinterpreted the data. So did three criminal-justice researchers, including one from a justice-reform organization. Who’s right—the reporters or the researchers? Krishna Gummadi, head of the Networked Systems Research Group at the Max Planck Institute for Software Systems in Saarbrücken, Germany, offers a surprising answer: they all are.

Gummadi, who has extensively researched fairness in algorithms, says ProPublica’s and Northpointe’s results don’t contradict each other. They differ because they use different measures of fairness.

Imagine you are designing a system to predict which criminals will reoffend. One option is to optimize for “true positives,” meaning that you will identify as many people as possible who are at high risk of committing another crime. One problem with this approach is that it tends to increase the number of false positives: people who will be unjustly classified as likely reoffenders. The dial can be adjusted to deliver as few false positives as possible, but that tends to create more false negatives: likely reoffenders who slip through and get a more lenient treatment than warranted.

Raising the incidence of true positives or lowering the false positives are both ways to improve a statistical measure known as positive predictive value, or PPV. That is the percentage of all positives that are true….

But if we accept that algorithms might make life fairer if they are well designed, how can we know whether they are so designed?

Democratic societies should be working now to determine how much transparency they expect from ADM systems. Do we need new regulations of the software to ensure it can be properly inspected? Lawmakers, judges, and the public should have a say in which measures of fairness get prioritized by algorithms. But if the algorithms don’t actually reflect these value judgments, who will be held accountable?

These are the hard questions we need to answer if we expect to benefit from advances in algorithmic technology…(More)”.