3D-CVF：使用跨视图的空间特征融合3D对象检测生成关节摄像头和激光镜头功能

论文标题

3D-CVF：使用跨视图的空间特征融合3D对象检测生成关节摄像头和激光镜头功能

3D-CVF: Generating Joint Camera and LiDAR Features Using Cross-View Spatial Feature Fusion for 3D Object Detection

论文作者

Yoo, Jin Hyeok, Kim, Yecheol, Kim, Jisong, Choi, Jun Won

论文摘要

在本文中，我们提出了一种新的深层体系结构，用于融合相机和LIDAR传感器，用于3D对象检测。由于相机和激光雷达传感器信号具有不同的特性和分布，因此预计这两种方式将提高3D对象检测的准确性和鲁棒性。相机和激光镜头融合所带来的挑战之一是，从每种模式获得的空间特征图由相机和世界坐标中的明显不同的视图表示。因此，将两个异质特征图组合在一起而不会丢失信息并不是一件容易的事。为了解决此问题，我们提出了一种称为3D-CVF的方法，该方法使用跨视图空间特征融合策略结合了相机和激光雷达功能。首先，该方法采用自动校准的投影，将2D摄像头功能转换为光滑的空间特征图，其对应与鸟类视图（BEV）域中的LiDar特征最高。然后，使用封闭式的特征融合网络使用空间注意图将相机混合在一起，并根据该地区适当地将LIDAR特征混合。接下来，在随后的提案完善阶段还可以实现摄像头功能融合。摄像头功能通过3D ROI网格池从2D摄像头视图域使用，并与BEV功能融合以进行建议。我们对KITTI和NUSCENES 3D对象检测数据集进行的评估表明，摄像头融合在单个模态上提供了显着的性能增长，并且提出的3D-CVF在KITTI基准测试中实现了最新性能。

In this paper, we propose a new deep architecture for fusing camera and LiDAR sensors for 3D object detection. Because the camera and LiDAR sensor signals have different characteristics and distributions, fusing these two modalities is expected to improve both the accuracy and robustness of 3D object detection. One of the challenges presented by the fusion of cameras and LiDAR is that the spatial feature maps obtained from each modality are represented by significantly different views in the camera and world coordinates; hence, it is not an easy task to combine two heterogeneous feature maps without loss of information. To address this problem, we propose a method called 3D-CVF that combines the camera and LiDAR features using the cross-view spatial feature fusion strategy. First, the method employs auto-calibrated projection, to transform the 2D camera features to a smooth spatial feature map with the highest correspondence to the LiDAR features in the bird's eye view (BEV) domain. Then, a gated feature fusion network is applied to use the spatial attention maps to mix the camera and LiDAR features appropriately according to the region. Next, camera-LiDAR feature fusion is also achieved in the subsequent proposal refinement stage. The camera feature is used from the 2D camera-view domain via 3D RoI grid pooling and fused with the BEV feature for proposal refinement. Our evaluations, conducted on the KITTI and nuScenes 3D object detection datasets demonstrate that the camera-LiDAR fusion offers significant performance gain over single modality and that the proposed 3D-CVF achieves state-of-the-art performance in the KITTI benchmark.

下载PDF全文

下载文献需遵守相关版权规定

论文标题