Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Mechanistic interpretability has made substantial progress in automatically localizing circuits, but explaining what localized components do remains labor-intensive and difficult to standardize.

In this work, we study whether language model (LM) agents can assist with this explanation problem once a circuit has already been identified.

We introduce AgenticInterpBench, a benchmark for circuit explanation built from 84 semi-synthetic transformer circuits with 163 component-level annotations.

We propose HyVE (Hypothesize, Validate, Explain), an agentic explainer that analyzes each component through an iterative loop of observation, hypothesis generation, and causal validation, eventually producing a component-level explanation and a circuit-level task description.

Across four LM backbones, HyVE recovers useful component- and task-level explanations, but no backbone is uniformly best.

Our analysis shows that strong backbones usually form observation-grounded hypotheses, while failures more often arise later in the validation loop, through incomplete validation plans, code execution errors, or unresolved hypotheses.

A case study on an arithmetic circuit in Llama-3-8B shows that the same formulation can extend beyond semi-synthetic benchmarks to naturally trained models.

Overall, LM agents are promising circuit explainers, but reliable validation remains the key obstacle.

전문 보기

Can Language Model Agents be Helpful Circuit Explainers in Mechanistic Interpretability?

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

RIFT-Bench: Dynamic Red-teaming For Agentic AI Systems

Neuro-Symbolic Drive: Rule-Grounded Faithful Reasoning for Driving VLAs

Critique of Agent Model

arXiv의 다른 기사

Breaking the Filter Bubble: A Semantic Pareto-DQN Framework for Multi-Objective Recommendation

Ensemble Feature Selection and Harris Hawks Optimization for Explainable Mental Health Risk Prediction in Female Sex Workers

Beyond Trajectory Imitation: Strategy-Guided Policy Optimization for LLM Reasoning