地铁 - 直线赛：视频掩码变压器

论文标题

地铁 - 直线赛：视频掩码变压器

TubeFormer-DeepLab: Video Mask Transformer

论文作者

Kim, Dahun, Xie, Jun, Wang, Huiyu, Qiao, Siyuan, Yu, Qihang, Kim, Hong-Seok, Adam, Hartwig, Kweon, In So, Chen, Liang-Chieh

论文摘要

我们提出了TubeFormer-DeepLab，这是以统一的方式解决多个核心视频分割任务的首次尝试。通常认为不同的视频分割任务（例如，视频语义/实例/全景分段）通常被认为是独特的问题。在单独的社区中采用的最新模型已经分歧，并且在每个任务中占主导地位。相比之下，我们进行了一个关键的观察，即通常可以将视频分割任务提出为将不同的预测标签分配给视频管的问题（其中通过沿时间轴链接分段掩码获得了管），并且标签可以根据目标任务编码不同的值。该观察结果促使我们开发了TubeFormer-DeepLab，这是一种简单有效的视频掩码变压器模型，广泛适用于多个视频分割任务。 TubeFormer-DeepLab直接通过特定于任务标签（纯语义类别或语义类别和实例身份）直接预测视频管，这不仅显着简化了视频分割模型，而且还可以在多个视频段中进步的最先进的结果。

We present TubeFormer-DeepLab, the first attempt to tackle multiple core video segmentation tasks in a unified manner. Different video segmentation tasks (e.g., video semantic/instance/panoptic segmentation) are usually considered as distinct problems. State-of-the-art models adopted in the separate communities have diverged, and radically different approaches dominate in each task. By contrast, we make a crucial observation that video segmentation tasks could be generally formulated as the problem of assigning different predicted labels to video tubes (where a tube is obtained by linking segmentation masks along the time axis) and the labels may encode different values depending on the target task. The observation motivates us to develop TubeFormer-DeepLab, a simple and effective video mask transformer model that is widely applicable to multiple video segmentation tasks. TubeFormer-DeepLab directly predicts video tubes with task-specific labels (either pure semantic categories, or both semantic categories and instance identities), which not only significantly simplifies video segmentation models, but also advances state-of-the-art results on multiple video segmentation benchmarks

下载PDF全文

下载文献需遵守相关版权规定

论文标题