通过引用合成模型来了解3D场景

论文标题

通过引用合成模型来了解3D场景

Towards 3D Scene Understanding by Referring Synthetic Models

论文作者

Chen, Runnan, Zhu, Xinge, Chen, Nenglun, Wang, Dawei, Li, Wei, Ma, Yuexin, Yang, Ruigang, Wang, Wenping

论文摘要

在点云上的视觉感知已经实现了有希望的表现。但是，当前的方法通常依赖于场景扫描中的劳动扩展注释。在本文中，我们探讨了合成模型如何减轻真实场景注释负担，即以标记为3D合成模型作为监督的参考，神经网络旨在在真实场景扫描中识别特定的对象类别（无需场景注释以进行监督）。问题研究如何将知识从合成3D模型转移到真实的3D场景，并被命名为参考转移学习（RTL）。主要的挑战是解决模型到现场（从单个模型到场景）以及合成模型与真实场景对象之间的合成模型与真实场景之间的差距。为此，我们提出了一个简单而有效的框架来执行两个对齐操作。首先，物理数据对齐旨在使合成模型通过数据处理技术涵盖场景对象的多样性。然后，一个新颖的\ textbf {convex-hull正规化特征对齐}引入了可学习的原型，以将合成模型和真实场景的点特征投射到统一的特征空间，从而减轻域间隙。这些操作可以使网络在真正看不见的场景上识别目标对象的模型到现场和合成的难度。实验表明，我们的方法通过从ModelNet数据集中学习合成模型，在扫描仪和S3DIS数据集上实现了46.08 \％和55.49％的平均映射。代码将公开可用。

Promising performance has been achieved for visual perception on the point cloud. However, the current methods typically rely on labour-extensive annotations on the scene scans. In this paper, we explore how synthetic models alleviate the real scene annotation burden, i.e., taking the labelled 3D synthetic models as reference for supervision, the neural network aims to recognize specific categories of objects on a real scene scan (without scene annotation for supervision). The problem studies how to transfer knowledge from synthetic 3D models to real 3D scenes and is named Referring Transfer Learning (RTL). The main challenge is solving the model-to-scene (from a single model to the scene) and synthetic-to-real (from synthetic model to real scene's object) gap between the synthetic model and the real scene. To this end, we propose a simple yet effective framework to perform two alignment operations. First, physical data alignment aims to make the synthetic models cover the diversity of the scene's objects with data processing techniques. Then a novel \textbf{convex-hull regularized feature alignment} introduces learnable prototypes to project the point features of both synthetic models and real scenes to a unified feature space, which alleviates the domain gap. These operations ease the model-to-scene and synthetic-to-real difficulty for a network to recognize the target objects on a real unseen scene. Experiments show that our method achieves the average mAP of 46.08\% and 55.49\% on the ScanNet and S3DIS datasets by learning the synthetic models from the ModelNet dataset. Code will be publicly available.

下载PDF全文

下载文献需遵守相关版权规定

论文标题