论文标题
打破视频对象分割中的“对象”
Breaking the "Object" in Video Object Segmentation
论文作者
论文摘要
当物体变换时,物体的外观可能会转瞬即逝。由于鸡蛋被折断或撕裂,它们的颜色,形状和质地会急剧变化,除了身份本身之外,几乎没有原始的东西。然而,现有视频对象分割(VOS)基准基本上不存在这一重要现象。在这项工作中,我们通过收集一个新的数据集来缩小差距,以用于转换下的视频对象分割(VOST)。它由700多个高分辨率视频组成,这些视频在不同的环境中捕获,它们平均长21秒,并用实例口罩密集标记。采用了仔细的多步骤方法,以确保这些视频将重点放在复杂的对象转换上,从而捕捉它们的全部时间范围。然后,我们广泛评估最先进的VOS方法并进行许多重要发现。特别是,我们表明,现有方法在应用于这项新任务时遇到困难,并且它们的主要限制在于过度依赖静态外观提示。这激发了我们为表现最佳基线的一些修改,通过更好地建模时空信息来提高其功能。但更广泛地说,希望是刺激讨论更强大的视频对象表示。
The appearance of an object can be fleeting when it transforms. As eggs are broken or paper is torn, their color, shape and texture can change dramatically, preserving virtually nothing of the original except for the identity itself. Yet, this important phenomenon is largely absent from existing video object segmentation (VOS) benchmarks. In this work, we close the gap by collecting a new dataset for Video Object Segmentation under Transformations (VOST). It consists of more than 700 high-resolution videos, captured in diverse environments, which are 21 seconds long on average and densely labeled with instance masks. A careful, multi-step approach is adopted to ensure that these videos focus on complex object transformations, capturing their full temporal extent. We then extensively evaluate state-of-the-art VOS methods and make a number of important discoveries. In particular, we show that existing methods struggle when applied to this novel task and that their main limitation lies in over-reliance on static appearance cues. This motivates us to propose a few modifications for the top-performing baseline that improve its capabilities by better modeling spatio-temporal information. But more broadly, the hope is to stimulate discussion on learning more robust video object representations.