输血：使用变压器检测3D对象检测的稳健的激光摄像机融合

论文标题

输血：使用变压器检测3D对象检测的稳健的激光摄像机融合

TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

论文作者

Bai, Xuyang, Hu, Zeyu, Zhu, Xinge, Huang, Qingqiu, Chen, Yilun, Fu, Hongbo, Tai, Chiew-Lan

论文摘要

LIDAR和相机是两个重要传感器，用于自动驾驶中的3D对象检测。尽管传感器融合在该领域的普及越来越普及，但针对劣等图像条件的鲁棒性（例如，照明和传感器错位）的稳健性却没有探索。现有的融合方法很容易受到此类条件的影响，这主要是由于LiDar点和图像像素的紧密关联，并由校准矩阵建立。我们提出了输血，这是一种具有软缔合机制来处理下图像条件的激光摄像机融合的强大解决方案。具体而言，我们的输血包括卷积骨架和基于变压器解码器的检测头。解码器的第一层使用一组稀疏的对象查询从激光雷达点云中预测初始边界框，其第二个解码器层可将对象查询与有用的图像特征自适应融合，从而利用空间和上下文关系。变压器的注意力机制使我们的模型能够自适应地确定应从图像中获取哪些信息以及哪些信息，从而导致强大而有效的融合策略。我们还设计了图像引导的查询初始化策略，以处理在点云中难以检测的对象。输血在大规模数据集上实现最先进的性能。我们提供了广泛的实验，以证明其对退化的图像质量和校准误差的鲁棒性。我们还将提出的方法扩展到3D跟踪任务，并在Nuscenes跟踪的排行榜中获得第一名，显示其有效性和概括能力。

LiDAR and camera are two important sensors for 3D object detection in autonomous driving. Despite the increasing popularity of sensor fusion in this field, the robustness against inferior image conditions, e.g., bad illumination and sensor misalignment, is under-explored. Existing fusion methods are easily affected by such conditions, mainly due to a hard association of LiDAR points and image pixels, established by calibration matrices. We propose TransFusion, a robust solution to LiDAR-camera fusion with a soft-association mechanism to handle inferior image conditions. Specifically, our TransFusion consists of convolutional backbones and a detection head based on a transformer decoder. The first layer of the decoder predicts initial bounding boxes from a LiDAR point cloud using a sparse set of object queries, and its second decoder layer adaptively fuses the object queries with useful image features, leveraging both spatial and contextual relationships. The attention mechanism of the transformer enables our model to adaptively determine where and what information should be taken from the image, leading to a robust and effective fusion strategy. We additionally design an image-guided query initialization strategy to deal with objects that are difficult to detect in point clouds. TransFusion achieves state-of-the-art performance on large-scale datasets. We provide extensive experiments to demonstrate its robustness against degenerated image quality and calibration errors. We also extend the proposed method to the 3D tracking task and achieve the 1st place in the leaderboard of nuScenes tracking, showing its effectiveness and generalization capability.

下载PDF全文

下载文献需遵守相关版权规定

论文标题