Reconstruction Alignment Improves Unified Multimodal Models
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Unified multimodal models (UMMs) unify visual understanding and generation within a single architecture.
However, conventional training relies on image-text pairs (or sequences) whose captions are typically sparse and miss fine-grained visual details, even when they use hundreds of words to describe a simple image.
We introduce Reconstruction Alignment (RECA), a resource-efficient post-training method that leverages visual understanding encoder embeddings as dense "text prompts", providing rich supervision without captions.
Concretely, RECA conditions a UMM on its own visual understanding embeddings and optimizes it to reconstruct the input image with a self-supervised reconstruction loss, thereby realigning understanding and generation.
Despite its simplicity, RECA is broadly applicable: across autoregressive, masked-autoregressive, and diffusion-based UMMs, it consistently improves generation and editing fidelity.
With only 27 GPU hours, post-training with RECA substantially improves image generation performance on GenEval (0.73 $\rightarrow$ 0.90) and DPGBench (80.93 $\rightarrow$ 88.15), while also boosting editing benchmarks (ImgEdit 3.38 $\rightarrow$ 3.75, GEdit 6.94 $\rightarrow$ 7.27).
Notably, RECA surpasses much larger open-source models and applies broadly across diverse UMM architectures, establishing it as an efficient and general post-training alignment strategy for UMMs.