Life After Benchmark Saturation: A Case Study of CORE-Bench

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

When a benchmark's accuracy saturates, it is often retired and replaced with a more challenging version.

We show that this approach privileges accuracy and misses the opportunity to study six other key dimensions of agent performance: construct validity issues such as shortcuts, out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus the scaffold, and uplift from human-agent collaboration.

We use CORE-Bench Hard, a benchmark for computational reproducibility of scientific code, as a case study to demonstrate that measuring agents along these dimensions yields meaningful insights into agent performance even after accuracy saturates.

First, we surface threats to construct validity in CORE-Bench Hard that are difficult to anticipate with less capable agents.

We introduce an improved benchmark, CORE-Bench v1.1, and an out-of-distribution task suite, CORE-Bench OOD.

Second, we find that despite accuracy saturation, CORE-Bench v1.1 remains useful for measuring efficiency, reliability, model performance, and scaffold performance.

Finally, we conduct a small-scale randomized experiment to measure uplift from human-agent collaboration on real-world computational reproducibility tasks.

We find a statistically significant speedup by about a factor of two -- likely underestimated due to one-fifth of human-only reproductions reaching the time limit before completing -- and describe various other findings.

Together, our contributions present a more rigorous alternative to the dominant accuracy-centric evaluation paradigm.

전문 보기

Life After Benchmark Saturation: A Case Study of CORE-Bench

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Detecting and Controlling Sycophancy with Cascading Linear Features

Refusal Lives Downstream of Persona in Chat Models

AlgoEvolve: LLM-driven Meta-evolution of Algorithmic Trading Programs

arXiv의 다른 기사

Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

Governing Actions, Not Agents: Institutional Attestation as a Governance Model for Autonomous AI Systems

COrigami: An AI Pipeline for Co-Designing Flat-Foldable Visually Recognisable Origami