VIT-BEVSEG：单眼鸟类视图分段的分层变压器网络

论文标题

VIT-BEVSEG：单眼鸟类视图分段的分层变压器网络

ViT-BEVSeg: A Hierarchical Transformer Network for Monocular Birds-Eye-View Segmentation

论文作者

Dutta, Pramit, Sistu, Ganesh, Yogamani, Senthil, Galván, Edgar, McDonald, John

论文摘要

在自动驾驶汽车和自动移动机器人技术中，生成详细的环境近场感知模型是一个重要且具有挑战性的问题。提供泛型表示形式的鸟类视图（BEV）地图是一种常用的方法，可为车辆周围环境提供简化的2D表示，并针对许多下游任务提供准确的语义水平细分。当前生成BEV映射的最新方法采用卷积神经网络（CNN）骨架来创建特征映射，这些特征映射通过空间变压器传递，以将派生的特征投射到BEV坐标框架上。在本文中，我们评估了视觉变压器（VIT）作为骨干结构来生成BEV图的使用。我们的网络架构VIT-BEVSEG使用标准视觉变压器来生成输入图像的多尺度表示。然后将所得表示形式作为空间变压器解码器模块的输入，该模块输出BEV网格中的分割图。我们在Nuscenes数据集上评估了我们的方法，证明了相对于最新方法的性能有了显着改善。

Generating a detailed near-field perceptual model of the environment is an important and challenging problem in both self-driving vehicles and autonomous mobile robotics. A Bird Eye View (BEV) map, providing a panoptic representation, is a commonly used approach that provides a simplified 2D representation of the vehicle surroundings with accurate semantic level segmentation for many downstream tasks. Current state-of-the art approaches to generate BEV-maps employ a Convolutional Neural Network (CNN) backbone to create feature-maps which are passed through a spatial transformer to project the derived features onto the BEV coordinate frame. In this paper, we evaluate the use of vision transformers (ViT) as a backbone architecture to generate BEV maps. Our network architecture, ViT-BEVSeg, employs standard vision transformers to generate a multi-scale representation of the input image. The resulting representation is then provided as an input to a spatial transformer decoder module which outputs segmentation maps in the BEV grid. We evaluate our approach on the nuScenes dataset demonstrating a considerable improvement in the performance relative to state-of-the-art approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题