论文标题
VIT-BEVSEG:单眼鸟类视图分段的分层变压器网络
ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation
论文作者
论文摘要
在自动驾驶汽车和自动移动机器人技术中,生成详细的环境近场感知模型是一个重要且具有挑战性的问题。提供泛型表示形式的鸟类视图(BEV)地图是一种常用的方法,可为车辆周围环境提供简化的2D表示,并针对许多下游任务提供准确的语义水平细分。当前生成BEV映射的最新方法采用卷积神经网络(CNN)骨架来创建特征映射,这些特征映射通过空间变压器传递,以将派生的特征投射到BEV坐标框架上。在本文中,我们评估了视觉变压器(VIT)作为骨干结构来生成BEV图的使用。我们的网络架构VIT-BEVSEG使用标准视觉变压器来生成输入图像的多尺度表示。然后将所得表示形式作为空间变压器解码器模块的输入,该模块输出BEV网格中的分割图。我们在Nuscenes数据集上评估了我们的方法,证明了相对于最新方法的性能有了显着改善。
Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image. The resulting representation is then provided as an input to a spatial transformer decoder module which outputs segmentation maps in the BEV grid. We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement in the performance relative to state-of-the-art approaches.