论文标题
MINVIS:最小视频实例分割框架,没有基于视频的培训
MinVIS: A Minimal Video Instance Segmentation Framework without Video-based Training
论文作者
论文摘要
我们提出了MINVIS,这是一个最小的视频实例细分(VIS)框架,该框架既可以通过基于视频的架构也不是培训程序来实现最先进的性能。通过仅训练基于查询的图像实例分割模型,MINVIS在具有挑战性的VIS数据集上优于先前的最佳结果,超过10%的AP。由于Minvis将培训视频中的框架视为独立图像,因此我们可以在培训视频中大量示例带有带有任何修改的培训视频框架。 MINVIS只有1%的标签框架优于表现,或与YouTube-VIS 2019/2021上的完全监督的最新方法相媲美。我们的主要观察结果是,受过训练以歧视框架内对象实例的查询在时间上是一致的,并且可以用于跟踪实例,而无需手动设计的启发式方法。因此,MINVIS具有以下推理管道:我们首先将基于查询的图像实例分割应用于视频帧。然后,通过相应查询的两部分匹配来跟踪分段的实例。此推论是以在线方式完成的,无需立即处理整个视频。因此,Minvis具有降低标签成本和内存要求的实际优势,而不是牺牲Vis绩效。代码可在以下网址找到:https://github.com/nvlabs/minvis
We propose MinVIS, a minimal video instance segmentation (VIS) framework that achieves state-of-the-art VIS performance with neither video-based architectures nor training procedures. By only training a query-based image instance segmentation model, MinVIS outperforms the previous best result on the challenging Occluded VIS dataset by over 10% AP. Since MinVIS treats frames in training videos as independent images, we can drastically sub-sample the annotated frames in training videos without any modifications. With only 1% of labeled frames, MinVIS outperforms or is comparable to fully-supervised state-of-the-art approaches on YouTube-VIS 2019/2021. Our key observation is that queries trained to be discriminative between intra-frame object instances are temporally consistent and can be used to track instances without any manually designed heuristics. MinVIS thus has the following inference pipeline: we first apply the trained query-based image instance segmentation to video frames independently. The segmented instances are then tracked by bipartite matching of the corresponding queries. This inference is done in an online fashion and does not need to process the whole video at once. MinVIS thus has the practical advantages of reducing both the labeling costs and the memory requirements, while not sacrificing the VIS performance. Code is available at: https://github.com/NVlabs/MinVIS