Rethinking Garment Conditioning in Diffusion-based Virtual Try-On: Decouple, Don't Denoise
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Virtual Try-On (VTON) synthesizes realistic images of a person wearing a target garment, with broad applications in e-commerce and fashion.
Diffusion-based dual-UNet methods achieve strong results but double the parameters by dedicating a separate network to garment conditioning.
Spatial concatenation offers a simpler single-network alternative, yet both UNet- and DiT-based instantiations report that full fine-tuning is ineffective, and the community has settled for attention-only training.
We ask: why does full fine-tuning fail, and can this be resolved?
Through what is, to our knowledge, the first visualization study of dual-UNet reference network behavior, we identify a unifying insight: garment conditioning must be decoupled from the denoising process.
Spatial concatenation violates this by embedding the garment within the denoising target, causing three conflicts: guidance leakage, gradient competition, and train-test discrepancy.
We derive three design principles to restore this decoupling and implement them as a pure recipe atop a standard architecture with no modification.
The resulting model, DeCo-VTON (860M params), achieves single-network state of the art, matching the dual-UNet state of the art at half the cost while being preferred in human evaluation.