scanents3d：在3D场景中利用改进的visio-liguistic模型来利用短语到3D对象

论文标题

scanents3d：在3D场景中利用改进的visio-liguistic模型来利用短语到3D对象

ScanEnts3D: Exploiting Phrase-to-3D-Object Correspondences for Improved Visio-Linguistic Models in 3D Scenes

论文作者

Abdelreheem, Ahmed, Olszewski, Kyle, Lee, Hsin-Ying, Wonka, Peter, Achlioptas, Panos

论文摘要

两个流行的数据集扫描[16]和Referit3d [3]将自然语言连接到现实世界3D数据。在本文中，我们通过将参考句子中提到的所有对象与3D场景中的基本实例相关联，策划了一个大规模和补充数据集，从而扩展了两个上述数据集。具体而言，我们的3D（Scanents3D）数据集中的扫描实体提供了跨84K自然参考句子的369K对象之间的明确对应关系，涵盖了705个真实世界的场景。至关重要的是，我们表明，通过将能够从这个新颖数据集中学习的直观损失结合在一起，我们可以显着提高几种最近引入的神经聆听体系结构的性能，包括将NR3D和ScanRefer基准中的SOTA提高4.3％和5.0％。此外，我们尝试了竞争性基线和语言生成任务的最新方法，并表明，与神经听众一样，3D神经扬声器也可以通过Scanents3D培训来显着受益，包括将SOTA提高到NR3D基准上的13.2 cider点。总体而言，我们精心进行的实验研究强烈支持这样一个结论，即通过学习scanents3d，常用的粘性3D体系结构可以在其概括方面变得更加有效，可解释，而无需在测试时提供这些新收集的注释。该项目的网页是https://scanents3d.github.io/。

The two popular datasets ScanRefer [16] and ReferIt3D [3] connect natural language to real-world 3D data. In this paper, we curate a large-scale and complementary dataset extending both the aforementioned ones by associating all objects mentioned in a referential sentence to their underlying instances inside a 3D scene. Specifically, our Scan Entities in 3D (ScanEnts3D) dataset provides explicit correspondences between 369k objects across 84k natural referential sentences, covering 705 real-world scenes. Crucially, we show that by incorporating intuitive losses that enable learning from this novel dataset, we can significantly improve the performance of several recently introduced neural listening architectures, including improving the SoTA in both the Nr3D and ScanRefer benchmarks by 4.3% and 5.0%, respectively. Moreover, we experiment with competitive baselines and recent methods for the task of language generation and show that, as with neural listeners, 3D neural speakers can also noticeably benefit by training with ScanEnts3D, including improving the SoTA by 13.2 CIDEr points on the Nr3D benchmark. Overall, our carefully conducted experimental studies strongly support the conclusion that, by learning on ScanEnts3D, commonly used visio-linguistic 3D architectures can become more efficient and interpretable in their generalization without needing to provide these newly collected annotations at test time. The project's webpage is https://scanents3d.github.io/ .

下载PDF全文

下载文献需遵守相关版权规定

论文标题