Graph-Detr3D：重新思考多视图3D对象检测的重叠区域

论文标题

Graph-Detr3D：重新思考多视图3D对象检测的重叠区域

Graph-DETR3D: Rethinking Overlapping Regions for Multi-View 3D Object Detection

论文作者

Chen, Zehui, Li, Zhenyu, Zhang, Shiquan, Fang, Liangji, Jiang, Qinhong, Zhao, Feng

论文摘要

来自多个图像视图的3D对象检测是视觉场景理解的一项基本且具有挑战性的任务。由于其低成本和高效率，多视图3D对象检测已显示出有希望的应用程序前景。但是，由于缺乏深度信息，可以通过3D空间中的透视图准确地检测对象非常困难。最近，DETR3D引入了一个新颖的3D-2D查询范式，用于汇总3D对象检测的多视图图像并实现最先进的性能。在本文中，通过密集的试点实验，我们量化了位于不同区域的对象，发现“截断实例”（即，在每个图像的边界区域）是阻碍Detr3d性能的主要瓶颈。尽管它从重叠区域中的两个相邻视图中合并了多个功能，但DETR3D仍然遭受功能聚合不足，因此错过了完全提高检测性能的机会。为了解决该问题，我们建议通过图形结构学习（GSL）自动汇总多视图图像信息。它在每个对象查询和2D特征图之间构造动态3D图，以增强对象表示形式，尤其是在边界区域。此外，Graph-Detr3D受益于新型的深度不变的多尺度训练策略，该策略通过同时缩放图像大小和对象深度来保持视觉深度一致性。在Nuscenes数据集上进行的广泛实验证明了我们的图形 - detr3d的有效性和效率。值得注意的是，我们最好的模型在Nuscenes测试排行榜上达到了49.5 ND，与各种已发表的图像视图3D对象探测器相比，获得了新的最先进。

3D object detection from multiple image views is a fundamental and challenging task for visual scene understanding. Due to its low cost and high efficiency, multi-view 3D object detection has demonstrated promising application prospects. However, accurately detecting objects through perspective views in the 3D space is extremely difficult due to the lack of depth information. Recently, DETR3D introduces a novel 3D-2D query paradigm in aggregating multi-view images for 3D object detection and achieves state-of-the-art performance. In this paper, with intensive pilot experiments, we quantify the objects located at different regions and find that the "truncated instances" (i.e., at the border regions of each image) are the main bottleneck hindering the performance of DETR3D. Although it merges multiple features from two adjacent views in the overlapping regions, DETR3D still suffers from insufficient feature aggregation, thus missing the chance to fully boost the detection performance. In an effort to tackle the problem, we propose Graph-DETR3D to automatically aggregate multi-view imagery information through graph structure learning (GSL). It constructs a dynamic 3D graph between each object query and 2D feature maps to enhance the object representations, especially at the border regions. Besides, Graph-DETR3D benefits from a novel depth-invariant multi-scale training strategy, which maintains the visual depth consistency by simultaneously scaling the image size and the object depth. Extensive experiments on the nuScenes dataset demonstrate the effectiveness and efficiency of our Graph-DETR3D. Notably, our best model achieves 49.5 NDS on the nuScenes test leaderboard, achieving new state-of-the-art in comparison with various published image-view 3D object detectors.

下载PDF全文

下载文献需遵守相关版权规定

论文标题