通过图像级别的类和伪造的跨模式对比学习，开放式摄氏3D检测

论文标题

通过图像级别的类和伪造的跨模式对比学习，开放式摄氏3D检测

Open-Vocabulary 3D Detection via Image-level Class and Debiased Cross-modal Contrastive Learning

论文作者

Lu, Yuheng, Xu, Chenfeng, Wei, Xiaobao, Xie, Xiaodong, Tomizuka, Masayoshi, Keutzer, Kurt, Zhang, Shanghang

论文摘要

当前的点云检测方法由于其有限的概括能力而难以检测现实世界中的开放式摄制对象。此外，收集和注释带有大量对象的点云检测数据集和完全注释的点云检测数据集非常艰巨且昂贵，这导致了现有的点云数据集的有限类别，并阻碍了模型以学习通用表示形式，以实现开放式Vocabulary Point-Cloud检测。据我们所知，我们是第一个研究开放式3D点云检测问题的问题。我们没有寻求带有完整标签的点云数据集，而是求助于ImagEnet1k，以扩大点云检测器的词汇。我们提出了使用图像级的类监督OV-3DITEC，这是一种开放式摄氏3D检测器。具体而言，我们利用了两种模态，即用于识别的图像模态和用于本地化的点云模态，以生成看不见类的伪标签。然后，我们提出了一种新颖的跨模式对比学习方法，将知识从图像模态转移到训练过程中的点云模态。在没有损害推理期间的延迟的情况下，OV-3Detic使点云检测器能够实现开放式视频检测。广泛的实验表明，所提出的OV-3DETIC分别在SUN-RGBD数据集和Scannet数据集上分别在Sun-RGBD数据集和Scannet数据集上的大量基准实现了至少10.77％的地图改进（绝对值）和9.56％的地图改进（绝对值）。此外，我们进行了足够的实验，以阐明为什么提出的OV-3Detic作品。

Current point-cloud detection methods have difficulty detecting the open-vocabulary objects in the real world, due to their limited generalization capability. Moreover, it is extremely laborious and expensive to collect and fully annotate a point-cloud detection dataset with numerous classes of objects, leading to the limited classes of existing point-cloud datasets and hindering the model to learn general representations to achieve open-vocabulary point-cloud detection. As far as we know, we are the first to study the problem of open-vocabulary 3D point-cloud detection. Instead of seeking a point-cloud dataset with full labels, we resort to ImageNet1K to broaden the vocabulary of the point-cloud detector. We propose OV-3DETIC, an Open-Vocabulary 3D DETector using Image-level Class supervision. Specifically, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo labels for unseen classes. Then we propose a novel debiased cross-modal contrastive learning method to transfer the knowledge from image modality to point-cloud modality during training. Without hurting the latency during inference, OV-3DETIC makes the point-cloud detector capable of achieving open-vocabulary detection. Extensive experiments demonstrate that the proposed OV-3DETIC achieves at least 10.77 % mAP improvement (absolute value) and 9.56 % mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD dataset and ScanNet dataset, respectively. Besides, we conduct sufficient experiments to shed light on why the proposed OV-3DETIC works.

下载PDF全文

下载文献需遵守相关版权规定

论文标题