图像文本匹配的共识感知视觉语义嵌入

论文标题

图像文本匹配的共识感知视觉语义嵌入

Consensus-Aware Visual-Semantic Embedding for Image-Text Matching

论文作者

Wang, Haoran, Zhang, Ying, Ji, Zhong, Pang, Yanwei, Ma, Lin

论文摘要

图像文本匹配在桥接视觉和语言中起着核心作用。大多数现有方法仅依赖于图像文本实例对来学习其表示形式，从而利用其匹配关系并进行相应的对齐方式。这种方法仅利用实例成对数据中包含的表面关联，而没有考虑任何外常识知识，这可能会阻碍其能力来推理图像和文本之间的高级关系。在本文中，我们提出了一种共识感知的视觉 - 语义嵌入（CVSE）模型，以结合共识信息，即两种模态之间共享的常识性知识，将其纳入图像文本匹配中。具体而言，通过计算图像字幕语料库的语义概念之间的统计共存在相关性来利用共识信息，并部署构造的概念相关图以产生共识感知概念（CAC）表示。之后，CVSE根据被剥削的共识以及两种模式的实例级表示，学习图像和文本之间的关联和对齐。在两个公共数据集上进行的广泛实验证明，被剥削的共识为构建更有意义的视觉语义嵌入做出了重大贡献，并且对双向图像和文本检索任务的最先进方法的出色表现。我们的本文代码可在以下网址获得：https：//github.com/brucew91/cvse。

Image-text matching plays a central role in bridging vision and language. Most existing approaches only rely on the image-text instance pair to learn their representations, thereby exploiting their matching relationships and making the corresponding alignments. Such approaches only exploit the superficial associations contained in the instance pairwise data, with no consideration of any external commonsense knowledge, which may hinder their capabilities to reason the higher-level relationships between image and text. In this paper, we propose a Consensus-aware Visual-Semantic Embedding (CVSE) model to incorporate the consensus information, namely the commonsense knowledge shared between both modalities, into image-text matching. Specifically, the consensus information is exploited by computing the statistical co-occurrence correlations between the semantic concepts from the image captioning corpus and deploying the constructed concept correlation graph to yield the consensus-aware concept (CAC) representations. Afterwards, CVSE learns the associations and alignments between image and text based on the exploited consensus as well as the instance-level representations for both modalities. Extensive experiments conducted on two public datasets verify that the exploited consensus makes significant contributions to constructing more meaningful visual-semantic embeddings, with the superior performances over the state-of-the-art approaches on the bidirectional image and text retrieval task. Our code of this paper is available at: https://github.com/BruceW91/CVSE.

下载PDF全文

下载文献需遵守相关版权规定

论文标题