3D视觉接地的多视图变压器

论文标题

3D视觉接地的多视图变压器

Multi-View Transformer for 3D Visual Grounding

论文作者

Huang, Shijia, Chen, Yilun, Jia, Jiaya, Wang, Liwei

论文摘要

3D视觉接地任务旨在将自然语言描述与3D场景中的目标对象进行扎根，这通常在3D点云中表示。以前的作品研究了视觉接地。一旦视图改变，以这种方式学到的视觉对应很容易失败。在本文中，我们提出了用于3D视觉接地的多视图变压器（MVT）。我们将3D场景投影到多视图空间，其中3D场景在不同视图下的位置信息同时建模并聚集在一起。多视图空间使网络能够学习一个更强大的多模式表示，以进行3D视觉接地，并消除对特定视图的依赖性。广泛的实验表明，我们的方法显着胜过所有最新方法。具体而言，在NR3D和SR3D数据集上，我们的方法的表现优于最佳竞争对手11.2％和7.1％，甚至超过了最近的2D援助，超过5.9％和6.6％。我们的代码可在https://github.com/sega-hsj/mvt-3dvg上找到。

The 3D visual grounding task aims to ground a natural language description to the targeted object in a 3D scene, which is usually represented in 3D point clouds. Previous works studied visual grounding under specific views. The vision-language correspondence learned by this way can easily fail once the view changes. In this paper, we propose a Multi-View Transformer (MVT) for 3D visual grounding. We project the 3D scene to a multi-view space, in which the position information of the 3D scene under different views are modeled simultaneously and aggregated together. The multi-view space enables the network to learn a more robust multi-modal representation for 3D visual grounding and eliminates the dependence on specific views. Extensive experiments show that our approach significantly outperforms all state-of-the-art methods. Specifically, on Nr3D and Sr3D datasets, our method outperforms the best competitor by 11.2% and 7.1% and even surpasses recent work with extra 2D assistance by 5.9% and 6.6%. Our code is available at https://github.com/sega-hsj/MVT-3DVG.

下载PDF全文

下载文献需遵守相关版权规定

论文标题