通过视觉单词学习和混合池进行视觉单词的弱监督语义细分

论文标题

通过视觉单词学习和混合池进行视觉单词的弱监督语义细分

Weakly-Supervised Semantic Segmentation with Visual Words Learning and Hybrid Pooling

论文作者

Ru, Lixiang, Du, Bo, Zhan, Yibing, Wu, Chen

论文摘要

使用图像级标签的弱监督语义分割（WSSS）方法通常训练分类网络，以生成类激活图（CAM）作为初始粗分段标签。但是，当前的WSS方法仍然无法令人满意，因为它们采用的凸轮1）通常专注于部分区分对象区域，而2）通常包含无用的背景区域。这两个问题归因于训练分类网络时的唯一图像级监督和全局信息的汇总。在这项工作中，我们提出了视觉单词学习模块和混合汇总方法，并将它们纳入分类网络以减轻上述问题。在视觉词学习模块中，我们通过执行分类网络学习细粒度的可视单词标签来应对第一个问题，从而可以发现更多的对象扩展。具体而言，视觉单词是通过代码手册来学习的，可以通过两种提出的策略（即基于学习的策略和记忆银行策略）进行更新。 CAM的第二个缺点可以通过拟议的混合池进行缓解，该混合动力池融合了全球平均水平和本地判别信息，以同时确保对象完整性并减少背景区域。我们评估了有关Pascal VOC 2012和MS Coco 2014数据集的方法。如果没有任何额外的显着性，我们的方法在$ Val $和$ ver $ set $ test Pascal VOC数据集上获得了70.6％和70.7％的MIOU，并且在Coco MS $ VAL $套件上的$ 36.2％MIOU，这显着超过了Sant-Art-The-Art WSSS方法的性能。

Weakly-Supervised Semantic Segmentation (WSSS) methods with image-level labels generally train a classification network to generate the Class Activation Maps (CAMs) as the initial coarse segmentation labels. However, current WSSS methods still perform far from satisfactorily because their adopted CAMs 1) typically focus on partial discriminative object regions and 2) usually contain useless background regions. These two problems are attributed to the sole image-level supervision and aggregation of global information when training the classification networks. In this work, we propose the visual words learning module and hybrid pooling approach, and incorporate them in the classification network to mitigate the above problems. In the visual words learning module, we counter the first problem by enforcing the classification network to learn fine-grained visual word labels so that more object extents could be discovered. Specifically, the visual words are learned with a codebook, which could be updated via two proposed strategies, i.e. learning-based strategy and memory-bank strategy. The second drawback of CAMs is alleviated with the proposed hybrid pooling, which incorporates the global average and local discriminative information to simultaneously ensure object completeness and reduce background regions. We evaluated our methods on PASCAL VOC 2012 and MS COCO 2014 datasets. Without any extra saliency prior, our method achieved 70.6% and 70.7% mIoU on the $val$ and $test$ set of PASCAL VOC dataset, respectively, and 36.2% mIoU on the $val$ set of MS COCO dataset, which significantly surpassed the performance of state-of-the-art WSSS methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题