Swe-Bench, a test used to evaluate the coding ability of AI, has become a key measure of AI progress, but developers are now “gaming” the test.
The benchmark has drifted away from evaluating actual capabilities of AI, calling its basic value into question, according to Stanford University’s Vanessa Pari.
Some AI researchers think the answer is to go smaller, inspired by the social sciences, asking developers to show their models do what they claim, and asking if what they are measuring has a coherent definition.
A new research group hosted by Hugging face, the University of Edinburgh, and EleutherAI, is developing ideas on validity and AI model evaluation, with a focus on building valid benchmarks that go beyond measuring straightforward capabilities.
This would involve reconnecting benchmarks to specific tasks, and having developers spell out exactly what capability their benchmark is testing, and how it relates to the tasks that make up the benchmark.