Why Can't I Open My Drawer? Mitigating Object-Driven Shortcuts in Zero-Shot Compositional Action Recognition
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
Zero-Shot Compositional Action Recognition (ZS-CAR) requires recognizing novel verb-object combinations composed of previously observed primitives.
In this work, we tackle a key failure mode: models predict verbs via object-driven shortcuts (i.e., relying on the labeled object class) rather than temporal evidence.
We argue that sparse compositional supervision and verb-object learning asymmetry can promote object-driven shortcut learning.
Our analysis with proposed diagnostic metrics shows that existing methods overfit to training co-occurrence patterns and underuse temporal verb cues, resulting in weak generalization to unseen compositions.
To address object-driven shortcuts, we propose Robust COmpositional REpresentations (RCORE) with two components.
Co-occurrence Prior Regularization (CPR) adds explicit supervision for unseen compositions and regularizes the model against frequent co-occurrence priors by treating them as hard negatives.
Temporal Order Regularization for Composition (TORC) enforces temporal-order sensitivity to learn temporally grounded verb representations.
Across Sth-com and EK100-com, RCORE reduces shortcut diagnostics and consequently improves compositional generalization.