Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Large Vision-Language Models (LVLMs) exhibit sophisticated reasoning but remain susceptible to object hallucination.

Deviating from the prevailing attention intensity assumption, we reveal a deeper dynamic structural misalignment: hallucination is triggered at decision-critical steps where specific attention heads, acting as risky mediators, decouple from visual evidence to lock onto language priors.

This establishes a pathological shortcut that bypasses visual grounding.

To dismantle this, we propose Fox (Faithfulness and Observational-flow via eXpression-rectification), a training-free inference-time framework.

Fox diagnoses structural misalignment using a visual attention entropy probe to localize risky mediators unsupervisedly.

We then execute a targeted causal intervention via numerical logit saturation to physically sever the shortcut path.

Finally, a conflict-gated cooperative decoding strategy reconciles interventional faithfulness with observational fluency.

Extensive experiments demonstrate that Fox achieves SOTA performance, outperforming SID by 29.1% while preserving linguistic richness.

Code is available at this https URL.

전문 보기

Dismantling Pathological Shortcuts: A Causal Framework for Faithful LVLM Decoding

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

AI-Model Network: Concept, Current State and Future

When Does Personality Composition Matter for Multi-Agent LLM Teams?

Internalizing the Future: A Unified Agentic Training Paradigm for World Model Planning

arXiv의 다른 기사

MER-R1: Multimodal Emotion Reasoning via Slow-Fast Thinking Synergy

ToE: A Hierarchical and Explainable Claim Verification Framework with Dynamic Multi-source Evidence Retrieval and Aggregation

Towards Reliable and Robust LLM Planning: Symbolic Feedback-Driven Iterative Self-Refinement Framework