Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor-intensive and difficult to standardize.
In this work, we study whether language model (LM) agents can assist with this explanation problem once a circuit has already been identified.
We introduce AgenticInterpBench, a benchmark for circuit explanation built from 84 semi-synthetic transformer circuits with 163 component-level annotations.
We propose HyVE (Hypothesize, Validate, Explain), an agentic explainer that analyzes each component through an iterative loop of observation, hypothesis generation, and causal validation, eventually producing a component-level explanation and a circuit-level task description.
Across four LM backbones, HyVE recovers useful component- and task-level explanations, but no backbone is uniformly best.
Our analysis shows that strong backbones usually form observation-grounded hypotheses, while failures more often arise later in the validation loop, through incomplete validation plans, code execution errors, or unresolved hypotheses.
A case study on an arithmetic circuit in Llama-3-8B shows that the same formulation can extend beyond semi-synthetic benchmarks to naturally trained models.
Overall, LM agents are promising circuit explainers, but reliable validation remains the key obstacle.