BenGER: Benchmarking LLM Systems on Subsumption-Based Legal Reasoning in German Law
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
We introduce BenGER (Benchmark for German Law), a benchmark and dataset for evaluating LLM systems on subsumption-based legal reasoning in German law.
The dataset combines 596 exam-style free-text legal case tasks across multiple levels of legal education and 531 short doctrinal reasoning tasks.
It includes a controlled validation subset of timed human-written solutions under both unaided and human-AI co-creation conditions.
We evaluate 12 contemporary LLM systems - closed flagship, efficiency-oriented, and open-weight - with a rubric-aligned LLM-as-a-Judge cross-validated against a multi-rater human-grading layer (three blind reviews per solution, six judge families benchmarked against the human pool).
Closed-flagship systems lead the leaderboard across all three corpora, human-AI co-creation measurably improves on unaided human work, and the LLM judge tracks human grading at Pearson r=0.76 and Cohen's \k{appa}=0.60.
System rankings are stable across judge families and two judges from independent providers clear the Calderon single-reviewer replacement bar on human-authored solutions.