LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments.
Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be faithful both to the distribution of real assistant tasks and to the execution semantics of the environments in which those tasks unfold.
Existing benchmarks often lose fidelity in one dimension or the other.
Their task distributions are shaped by what is easy to isolate, mock, and verify, underrepresenting real-world difficulties such as cross-service dependency, contaminated state, implicit intent, and runtime change.
Their environments are either live but hard to reproduce, or reproducible but reduced to endpoint-level stubs that remove sessions, artifacts, state transitions, and downstream side effects.
We introduce LiveClawBench, a benchmark designed around this dual-fidelity requirement.
LiveClawBench combines a Triple-Axis Complexity Framework for difficulty-driven task construction with reproducible full-stack mock applications that preserve stateful execution semantics.
With 134 executable cases across 10 domains with 22 mocked services, LiveClawBench supports controlled, extensible, and factor-level diagnostic evaluation of realistic agentic tasks.
We release the benchmark resources: (1) Benchmark: this https URL (2) Leaderboard: this https URL (3) Trajectories: this https URL