LLM-as-a-judge validity in physics assessment depends more on the task than the model
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
As large language models (LLMs) are increasingly considered for automated assessment and feedback, understanding when LLM marking is valid is essential.
We evaluate LLM-as-a-judge marking across three physics assessment formats - structured questions, written essays, and scientific plots - comparing GPT-5.2, Grok 4.1, Claude Opus 4.5, DeepSeek-V3.2, Gemini Pro 3, and committee aggregations against human markers under blind, solution-provided, false-solution, and anchored conditions.
We distinguish absolute accuracy from rank-order agreement, since a marking system can match the distribution of human marks while failing to order responses by quality.
Across task types, performance is sharply task-dependent.
For blind university exam questions ($n=771$) and secondary and university structured questions ($n=1151$), models show robust rank-order agreement with human markers (Spearman $\rho > 0.6$), with official solutions reducing error and strengthening agreement.
False solutions degrade absolute accuracy, showing that models defer to provided references, but leave rank-ordering intact.
Essay marking behaves fundamentally differently.
Across $n=55$ scripts ($n=275$ essays), blind AI marking is harsher and more variable than human marking and adding a mark scheme does not improve rank-order agreement.
Anchored exemplars shift the AI mean close to the human mean and compress variance below the human standard deviation, but rank-order agreement remains near-zero.
For code-based plot elements ($n=1400$), models achieve high rank-order agreement ($\rho > 0.84$) with near-linear calibration.
Across all task types, validity tracks the structure of the assessment task - the extent to which marks can be mapped to explicit, observable grading features - and the reliability of the human benchmark, rather than raw model capability.