Text Over Image: Auditing Multimodal Robustness in Synthetic Medical Image Detection
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
With the rapid adoption of generative AI, synthetic medical images pose growing risks, including diagnostic deception and insurance fraud.
Although prior work has explored vision-language model (VLM)-based synthetic image detection, these evaluations typically consider images in isolation.
In clinical practice, however, images are interpreted alongside structured records and metadata, and VLMs are increasingly deployed under joint image-record inputs.
We uncover a previously underexamined multimodal vulnerability: when given both modalities, VLMs may overweight record context in authenticity judgments, such that the same image receives different predictions solely due to changes in its accompanying text.
This raises concerns about robustness in real-world deployment.
To systematically characterize this effect, we reformulate synthetic medical image detection as an audit of multimodal robustness at the image-record interface and introduce a paired benchmark that holds the image fixed while swapping controlled metadata variants.
Across multiple imaging modalities, we evaluate diverse open-weight and frontier API VLMs and find that changing the metadata context alone can flip authenticity judgments, with accuracy on authentic images dropping by 61.1% on average under an explicit AI-origin tag.
We further propose an inference-time mitigation pipeline that detects and neutralizes provenance shortcuts without model retraining, substantially outperforming direct prompt-based suppression on the affected subset.
Our benchmark provides a standardized tool for assessing and improving multimodal robustness beyond image-only settings.
Code and data will be released upon acceptance.