White Paper by the Aspen Institute: “…When people develop machine learning models for AI products and services, they iterate to improve performance.
What it means to “improve” a machine learning model depends on what you want the model to do, like correctly transcribe an audio sample or generate a reliable summary of a long document.
Machine learning benchmarks are similar to standardized tests that AI researchers and builders can score their work against. Benchmarks allow us to both see if different model tweaks improve the performance for the intended task and compare similar models against one another.
Some famous benchmarks in AI include ImageNet and the Stanford Question Answering Dataset (SQuAD).
Benchmarks are important, but their development and adoption has historically been somewhat arbitrary. The capabilities that benchmarks measure should reflect the priorities for what the public wants AI tools to be and do.
We can build positive AI futures, ones that emphasize what the public wants out of these emerging technologies. As such, it’s imperative that we build benchmarks worth striving for…(More)”.