M $^2 $ BEV：带有统一鸟类视图表示的多相机联合3D检测和细分

论文标题

M $^2 $ BEV：带有统一鸟类视图表示的多相机联合3D检测和细分

M$^2$BEV: Multi-Camera Joint 3D Detection and Segmentation with Unified Birds-Eye View Representation

论文作者

Xie, Enze, Yu, Zhiding, Zhou, Daquan, Philion, Jonah, Anandkumar, Anima, Fidler, Sanja, Luo, Ping, Alvarez, Jose M.

论文摘要

在本文中，我们提出了M $^2 $ BEV，这是一个统一的框架，在鸟类视图中共同执行3D对象检测和地图分割，并具有多相机图像输入。与以前分别处理检测和细分的大多数作品不同，m $^2 $ bev不使用统一模型来提高这两个任务并提高效率。 M $^2 $ BEV有效地将多视图2D图像功能转换为自我卡车坐标中的3D BEV功能。这种BEV表示很重要，因为它可以使不同的任务共享一个编码器。我们的框架进一步包含了四个重要的设计，这些设计既有益于准确性和效率：（1）有效的BEV编码器设计，可降低体素特征图的空间维度。（2）一种动态框分配策略，该策略使用学习对匹配来分配带有锚的地面真相3D框。（3）BEV中心度重新加权，以更大的权重以更遥远的预测，以及（4）大规模2D检测预训练和辅助监督。我们表明，这些设计极大地使基于摄像头的3D感知任务受益，其中缺少深度信息。 M $^2 $ BEV是记忆效率的，可以更高的分辨率图像作为输入，并且推理速度更快。 Nuscenes上的实验表明，M $^2 $ BEV可以在3D对象检测和BEV分割中获得最新的结果，其中最佳单个模型在这两个任务中分别实现了42.5 MAP和57.0 MIOU。

In this paper, we propose M$^2$BEV, a unified framework that jointly performs 3D object detection and map segmentation in the Birds Eye View~(BEV) space with multi-camera image inputs. Unlike the majority of previous works which separately process detection and segmentation, M$^2$BEV infers both tasks with a unified model and improves efficiency. M$^2$BEV efficiently transforms multi-view 2D image features into the 3D BEV feature in ego-car coordinates. Such BEV representation is important as it enables different tasks to share a single encoder. Our framework further contains four important designs that benefit both accuracy and efficiency: (1) An efficient BEV encoder design that reduces the spatial dimension of a voxel feature map. (2) A dynamic box assignment strategy that uses learning-to-match to assign ground-truth 3D boxes with anchors. (3) A BEV centerness re-weighting that reinforces with larger weights for more distant predictions, and (4) Large-scale 2D detection pre-training and auxiliary supervision. We show that these designs significantly benefit the ill-posed camera-based 3D perception tasks where depth information is missing. M$^2$BEV is memory efficient, allowing significantly higher resolution images as input, with faster inference speed. Experiments on nuScenes show that M$^2$BEV achieves state-of-the-art results in both 3D object detection and BEV segmentation, with the best single model achieving 42.5 mAP and 57.0 mIoU in these two tasks, respectively.

下载PDF全文

下载文献需遵守相关版权规定

论文标题