论文标题
视频全景分割的时空变压器
Time-Space Transformers for Video Panoptic Segmentation
论文作者
论文摘要
我们为视频综合分段的任务提出了一种新颖的解决方案,该解决方案同时预测了像素级的语义和实例分割,并生成了剪辑级实例轨道。我们的网络被称为VPS-Transformer,它具有基于最先进的全景分割网络全景深度列表的混合体系结构,结合了用于单帧全盘分段的卷积架构和基于纯变压器块的实例化的新型视频模块。该变压器配备了注意力机制,模拟了当前和过去框架的骨架输出特征之间的时空关系,以进行更准确,一致的全景估计。当纯变压器块在处理高分辨率图像时引入大型计算开销时,我们提出了一些设计更改以进行更有效的计算。我们研究了如何在整个时空量中更有效地汇总信息,并将变压器块的几种变体与不同的注意力方案进行比较。在CityScapes-VPS数据集上进行的广泛实验表明,我们的最佳模型可以提高时间一致性和视频综合质量的边距2.2%,而额外的计算很少。
We propose a novel solution for the task of video panoptic segmentation, that simultaneously predicts pixel-level semantic and instance segmentation and generates clip-level instance tracks. Our network, named VPS-Transformer, with a hybrid architecture based on the state-of-the-art panoptic segmentation network Panoptic-DeepLab, combines a convolutional architecture for single-frame panoptic segmentation and a novel video module based on an instantiation of the pure Transformer block. The Transformer, equipped with attention mechanisms, models spatio-temporal relations between backbone output features of current and past frames for more accurate and consistent panoptic estimates. As the pure Transformer block introduces large computation overhead when processing high resolution images, we propose a few design changes for a more efficient compute. We study how to aggregate information more effectively over the space-time volume and we compare several variants of the Transformer block with different attention schemes. Extensive experiments on the Cityscapes-VPS dataset demonstrate that our best model improves the temporal consistency and video panoptic quality by a margin of 2.2%, with little extra computation.