论文标题
通过跨模式渐进理解参考图像分割
Referring Image Segmentation via Cross-Modal Progressive Comprehension
论文作者
论文摘要
参考图像分割旨在分割实体的前景面具,这些面具可以很好地匹配自然语言表达式中给出的描述。以前的方法使用隐式特征互动和视觉和语言方式之间的融合来解决此问题,但通常无法探索该表达式的信息词,以与两种模式相结合,以准确地识别引用的实体。在本文中,我们提出了一个跨模式的渐进理解(CMPC)模块和文本引导的功能交换(TGFE)模块,以有效解决具有挑战性的任务。具体而言,CMPC模块首先采用实体和属性单词来感知表达式可能考虑的所有相关实体。然后,采用关系词来突出正确的实体,并通过多模式图推理抑制其他无关的实体。除了CMPC模块外,我们还进一步利用了一个简单而有效的TGFE模块,将不同级别的合理的多模式特征与文本信息的指导集成在一起。这样,多层次的功能可以相互通信,并根据文本上下文进行完善。我们对四个流行的参考细分基准测试并实现新的最新性能进行了广泛的实验。
Referring image segmentation aims at segmenting the foreground masks of the entities that can well match the description given in the natural language expression. Previous approaches tackle this problem using implicit feature interaction and fusion between visual and linguistic modalities, but usually fail to explore informative words of the expression to well align features from the two modalities for accurately identifying the referred entity. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) module and a Text-Guided Feature Exchange (TGFE) module to effectively address the challenging task. Concretely, the CMPC module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the correct entity as well as suppress other irrelevant ones by multimodal graph reasoning. In addition to the CMPC module, we further leverage a simple yet effective TGFE module to integrate the reasoned multimodal features from different levels with the guidance of textual information. In this way, features from multi-levels could communicate with each other and be refined based on the textual context. We conduct extensive experiments on four popular referring segmentation benchmarks and achieve new state-of-the-art performances.