PruneGround: Plug-and-play Spatial Pruning for 3D Visual Grounding
이 뉴스, 어떠셨어요?
한 번의 탭으로 반응을 남겨요 · 로그인 불필요
Abstract
3D Visual Grounding (3DVG) aims to localize target objects in 3D scenes given natural language descriptions.
Existing approaches typically perform reasoning over the entire scene, leading to ambiguous predictions and high computational cost, especially in cluttered environments.
We observe that many referential expressions rely on local spatial context and often correspond to restricted spatial regions rather than the full scene.
Motivated by this insight, we propose PruneGround, an effective plug-and-play framework for 3DVG built upon three key components.
First, we introduce Language-Guided Spatial Pruning (LGSP), which leverages a frozen Vision Language Model (VLM) to identify language-relevant regions, thereby reducing spatial computation and grounding candidates in the narrower search space.
Second, we propose MultiView-Conditioned Description Reformulation (MCDR), which decomposes complex expressions into simplified target-anchor relations and augments missing spatial cues through multi-view reasoning.
Finally, we propose LLM-Grounder, which repurposes a detection-pretrained spatial LLM into a language-conditioned grounding model by aligning point cloud and linguistic representations within the pruned region.
Extensive experiments on the three most popular point cloud benchmarks demonstrate that our method achieves state-of-the-art results on all three ScanRefer settings and on 9 out of 10 Nr3D/Sr3D settings.
Code and models are publicly available: this https URL