Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination.
Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors.
This establishes a pathological shortcut that bypasses visual grounding.
To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework.
Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly.
We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path.
Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency.
Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness.
Code is available at this https URL.