LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

OpenClaw-style personal assistants extend LLM agents from isolated tool use to open-ended, stateful, and personalized software environments.

Evaluating these assistants is fundamentally a fidelity problem: benchmarks must be faithful both to the distribution of real assistant tasks and to the execution semantics of the environments in which those tasks unfold.

Existing benchmarks often lose fidelity in one dimension or the other.

Their task distributions are shaped by what is easy to isolate, mock, and verify, underrepresenting real-world difficulties such as cross-service dependency, contaminated state, implicit intent, and runtime change.

Their environments are either live but hard to reproduce, or reproducible but reduced to endpoint-level stubs that remove sessions, artifacts, state transitions, and downstream side effects.

We introduce LiveClawBench, a benchmark designed around this dual-fidelity requirement.

LiveClawBench combines a Triple-Axis Complexity Framework for difficulty-driven task construction with reproducible full-stack mock applications that preserve stateful execution semantics.

With 134 executable cases across 10 domains with 22 mocked services, LiveClawBench supports controlled, extensible, and factor-level diagnostic evaluation of realistic agentic tasks.

We release the benchmark resources: (1) Benchmark: this https URL (2) Leaderboard: this https URL (3) Trajectories: this https URL

전문 보기

LiveClawBench: Benchmarking LLM Agents on Complex, Real-World Assistant Tasks

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

AI-Model Network: Concept, Current State and Future

When Does Personality Composition Matter for Multi-Agent LLM Teams?

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

arXiv의 다른 기사

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

Towards Reliable and Robust LLM Planning: Symbolic Feedback-Driven Iterative Self-Refinement Framework