Stateful Token Reduction for Long-Video Hybrid VLMs
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Token reduction accelerates long-video vision--language models (VLMs), but existing methods target Transformers, where reduction is treated as token pruning.
We study token reduction in hybrid Mamba--Transformer VLMs and find that it is \emph{stateful}: Mamba layers maintain a recurrent state that accumulates information from earlier tokens, allowing discarded tokens to persist, so reduction behaves more like compression than this http URL support this view with a representation-based probing method measuring how much information from discarded tokens is retained, and analyze layer-wise sparsity and cross-layer importance stability.
Our findings show importance is sparse within layers but unstable across layers, making aggressive early pruning unreliable while hybrids remain robust to later this http URL by this, we propose a hybrid-aware token reduction framework with a low-to-high progressive schedule and a unified query-conditioned importance score for attention and Mamba layers.
For Mamba, excluding the position-dependent decay from the recurrence produces a stronger selection signal.
Across long-video benchmarks, our method achieves $3.8{\times}$--$4.2{\times}$ prefilling speedups at a 25% token budget while maintaining near-baseline accuracy and improving with light finetuning.
Hybrid models benefit from aggressive reduction, improving both efficiency and accuracy, whereas Transformers exhibit the standard trade-off.
Our method also outperforms prior baselines on the same hybrid backbone and combines effectively with visual redundancy reduction methods.