Semantic Generative Tuning for Unified Multimodal Models
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture.
However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives.
Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.
This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs.
Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies.
Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.
Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities.
Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern.
Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks.
Our code is available on the this https URL.