Semantic Generative Tuning for Unified Multimodal Models

arXiv CS.AI

이 뉴스, 어떠셨어요?

한 번의 탭으로 반응을 남겨요 · 로그인 불필요

CC BY

이 매체는 공공·자유 라이선스로 본문을 직접 표시합니다.

Abstract

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture.

However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives.

Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement.

This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs.

Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies.

Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity.

Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities.

Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern.

Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks.

Our code is available on the this https URL.

전문 보기

Semantic Generative Tuning for Unified Multimodal Models

이 뉴스, 어떠셨어요?

Abstract

관련 뉴스

'research' 카테고리 뉴스

Detecting and Controlling Sycophancy with Cascading Linear Features

Life After Benchmark Saturation: A Case Study of CORE-Bench

Refusal Lives Downstream of Persona in Chat Models

arXiv의 다른 기사

Knowledge-augmented Agentic AI for Mental Health Medication Information Seeking

Accelerating Skill Assessment in Chess: A Drift-Diffusion-Enhanced Elo Rating System

Governing Actions, Not Agents: Institutional Attestation as a Governance Model for Autonomous AI Systems