论文标题
解决视频中实例细分的难题:通过时空协作的弱监督框架
Solve the Puzzle of Instance Segmentation in Videos: A Weakly Supervised Framework with Spatio-Temporal Collaboration
论文作者
论文摘要
旨在在视频帧中细分和跟踪多个对象的视频中的实例细分,近年来引起了一系列研究的关注。在本文中,我们提出了一个新颖的弱监督框架,它使用\ textbf {s} apatio- \ textbf {t} emporal \ textbf {c} ollaboration for fimess \ textbf {seg {seg {seg}在视频中,即\ textbf {stc-seg}。具体而言,STC-Seg展示了四个贡献。首先,我们利用无监督的深度估计和光流的互补表示形式来产生有效的伪标记,用于训练深层网络并预测高质量的实例掩码。其次,为了增强口罩的生成,我们设计了难题损失,该谜可以使用盒子级注释进行端到端的培训。第三,我们的跟踪模块共同利用与时空差异的边界框对角点来模型运动,这在很大程度上可以改善对不同对象出现的鲁棒性。最后,我们的框架是灵活的,并启用图像级实例分割方法可以操作视频级任务。我们在Kitti MOTS和YT-VIS数据集上进行了大量实验。实验结果表明,我们的方法实现了强大的性能,甚至超过了完全监督的TrackR-CNN和MaskTrack R-CNN。我们认为,STC-Seg可能是社区的宝贵补充,因为它反映了冰山一角,例如弱监督范式中的创新机会,例如视频中的细分。
Instance segmentation in videos, which aims to segment and track multiple objects in video frames, has garnered a flurry of research attention in recent years. In this paper, we present a novel weakly supervised framework with \textbf{S}patio-\textbf{T}emporal \textbf{C}ollaboration for instance \textbf{Seg}mentation in videos, namely \textbf{STC-Seg}. Concretely, STC-Seg demonstrates four contributions. First, we leverage the complementary representations from unsupervised depth estimation and optical flow to produce effective pseudo-labels for training deep networks and predicting high-quality instance masks. Second, to enhance the mask generation, we devise a puzzle loss, which enables end-to-end training using box-level annotations. Third, our tracking module jointly utilizes bounding-box diagonal points with spatio-temporal discrepancy to model movements, which largely improves the robustness to different object appearances. Finally, our framework is flexible and enables image-level instance segmentation methods to operate the video-level task. We conduct an extensive set of experiments on the KITTI MOTS and YT-VIS datasets. Experimental results demonstrate that our method achieves strong performance and even outperforms fully supervised TrackR-CNN and MaskTrack R-CNN. We believe that STC-Seg can be a valuable addition to the community, as it reflects the tip of an iceberg about the innovative opportunities in the weakly supervised paradigm for instance segmentation in videos.