MM-3DDSCENE：3D场景通过使用信息保留的重建和自固定的一致性来定制蒙面建模来理解3D场景

论文标题

MM-3DDSCENE：3D场景通过使用信息保留的重建和自固定的一致性来定制蒙面建模来理解3D场景

MM-3DScene: 3D Scene Understanding by Customizing Masked Modeling with Informative-Preserved Reconstruction and Self-Distilled Consistency

论文作者

Xu, Mingye, Xu, Mutian, He, Tong, Ouyang, Wanli, Wang, Yali, Han, Xiaoguang, Qiao, Yu

论文摘要

通过重建掩盖的视觉贴片，蒙面建模（MM）在各种视觉挑战中取得了广泛的成功。然而，由于数据稀疏性和场景复杂性，将MM应用于大规模的3D场景仍然是一个空旷的问题。在恢复3D场景的蒙版区域时，2D图像中使用的常规随机掩蔽范式通常会引起高度歧义的风险。为此，我们提出了一项新颖的保存完善的重建，该重建探索了本地统计数据以发现和保留代表性的结构化点，从而有效地增强了借口掩盖任务，以了解3D场景的理解。我们的方法以渐进的重建方式集成，可以集中于建模区域几何形状，并享受掩盖重建的模棱两可。此外，这种具有渐进式掩蔽比的场景也可以使其内在的空间一致性有助于自我依赖，从而需要从未面积的区域学习一致的表示。通过优雅地结合了在掩盖区域上保存的保存保留的重建，并从未面积的区域结合了一致性自我鉴定，就产生了一个称为MM-3DSCENE的统一框架。我们对许多下游任务进行了全面的实验。一致的改进（例如，在对象检测上+6.1 [email protected]，语义细分 +2.2％miou）证明了我们方法的优势。

Masked Modeling (MM) has demonstrated widespread success in various vision challenges, by reconstructing masked visual patches. Yet, applying MM for large-scale 3D scenes remains an open problem due to the data sparsity and scene complexity. The conventional random masking paradigm used in 2D images often causes a high risk of ambiguity when recovering the masked region of 3D scenes. To this end, we propose a novel informative-preserved reconstruction, which explores local statistics to discover and preserve the representative structured points, effectively enhancing the pretext masking task for 3D scene understanding. Integrated with a progressive reconstruction manner, our method can concentrate on modeling regional geometry and enjoy less ambiguity for masked reconstruction. Besides, such scenes with progressive masking ratios can also serve to self-distill their intrinsic spatial consistency, requiring to learn the consistent representations from unmasked areas. By elegantly combining informative-preserved reconstruction on masked areas and consistency self-distillation from unmasked areas, a unified framework called MM-3DScene is yielded. We conduct comprehensive experiments on a host of downstream tasks. The consistent improvement (e.g., +6.1 [email protected] on object detection and +2.2% mIoU on semantic segmentation) demonstrates the superiority of our approach.

下载PDF全文

下载文献需遵守相关版权规定

论文标题