학술
기타
From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
arXiv CS.AI
조회 0
CC BY
이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.Computer Science > Artificial Intelligence
[Submitted on 8 Jun 2026]
Title:From Senses to Decisions: The Information Flow of Auditory and Visual Perception in Multimodal LLMs
View PDF HTML (experimental)Abstract:Multimodal Large Language Models (MLLMs) can listen and see, but how do audio and visual signals actually travel through the network to shape an answer? Despite their growing role in research and real-world applications, the internal pathways through which audio and visual tokens influence the final prediction remain poorly understood. In this study, we examine audio-visual information flow inside Audio-Visual Large Language Models (AVLLMs), tracing how AVLLMs route, utilize, and integrate audio and visual information across two input configurations, audio-visual video and multiple interleaved audio-visual items. We find that for audio-visual video, AVLLMs follow the sequential information flow pathway established for VLMs and VideoLLMs, with audio and visual contribution flowing along this pathway in proportion to the task's reliance on each modality. In settings with multiple interleaved audio-visual items, this routing shifts to different parallel streams. Furthermore, we demonstrate that audio-visual and other token types can be discarded once their information is transferred to LLM, with minimal impact on the model's prediction or even slight improvement, generalizing across multiple tasks and datasets, enabling more efficient inference. These findings hold across multiple models and scales, Qwen2.5-Omni and Video-SALMONN2 Plus at 3B and 7B scales, leading to hypotheses on why these flow structures emerge. Together, these results deliver the first coherent picture of how AVLLMs orchestrate sound and sight inside the network and lay the groundwork for the next wave of interpretability, design, and efficiency advances in audio-visual and broader MLLMs.
Submission history
From: Wish Suharitdamrong [view email][v1] Mon, 8 Jun 2026 20:26:09 UTC (3,121 KB)
Current browse context:
cs.AI
References & Citations
Loading...
Bibliographic and Citation Tools
Bibliographic Explorer (What is the Explorer?)
Connected Papers (What is Connected Papers?)
Litmaps (What is Litmaps?)
scite Smart Citations (What are Smart Citations?)
Code, Data and Media Associated with this Article
alphaXiv (What is alphaXiv?)
CatalyzeX Code Finder for Papers (What is CatalyzeX?)
DagsHub (What is DagsHub?)
Gotit.pub (What is GotitPub?)
Hugging Face (What is Huggingface?)
ScienceCast (What is ScienceCast?)
Demos
Recommenders and Search Tools
Influence Flower (What are Influence Flowers?)
CORE Recommender (What is CORE?)
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs.
관련 뉴스
관련 뉴스 제보는 로그인 후 가능합니다.
'research' 카테고리 뉴스
Correction: Family support after a family member’s suicide: A qualitative exploration
PLOS ONE
Correction: ESG rating disagreement and bank loan availability: Evidence from China
PLOS ONE
Microbiome analysis of bronchoalveolar lavage (BAL) specimens from immunocompromised patients with pneumonia compared to those from healthy volunteers
PLOS ONE