使用图神经网络的多模式检索

论文标题

使用图神经网络的多模式检索

Multi-Modal Retrieval using Graph Neural Networks

论文作者

Misraa, Aashish Kumar, Kale, Ajinkya, Aggarwal, Pranav, Aminian, Ali

论文摘要

图像检索的大多数现实世界应用，例如Adobe Stock（是库存摄影和插图）的市场，需要一种方法让用户找到一种在视觉上（即美学上）和概念上（即包含相同的明显对象）作为查询图像的图像。从图像中学习视觉语义表示是一个精心研究的图像检索问题。传统上，基于图像概念或属性的过滤是通过基于索引的过滤（例如，在文本标签上）或在基于初始视觉嵌入的检索后重新排列来实现的。在本文中，我们学习了在相同的高维空间中嵌入的联合视觉和概念。该联合模型可为用户对结果集的语义进行细粒度的控制，从而使他们能够更迅速地探索图像的目录。我们将视觉和概念关系建模为图形结构，它通过节点邻域捕获了丰富的信息。该图结构有助于我们使用图神经网络学习多模式节点嵌入。我们还基于选择性邻域连接，允许用户控制检索算法，从而引入了一种新颖的推理时间控制。我们在MS-Coco数据集上的下游相关任务上定量评估了这些多模式嵌入，并在MS-Coco和Adobe Stock DataSet上定性地评估了这些嵌入。

Most real world applications of image retrieval such as Adobe Stock, which is a marketplace for stock photography and illustrations, need a way for users to find images which are both visually (i.e. aesthetically) and conceptually (i.e. containing the same salient objects) as a query image. Learning visual-semantic representations from images is a well studied problem for image retrieval. Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval. In this paper, we learn a joint vision and concept embedding in the same high-dimensional space. This joint model gives the user fine-grained control over the semantics of the result set, allowing them to explore the catalog of images more rapidly. We model the visual and concept relationships as a graph structure, which captures the rich information through node neighborhood. This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks. We also introduce a novel inference time control, based on selective neighborhood connectivity allowing the user control over the retrieval algorithm. We evaluate these multi-modal embeddings quantitatively on the downstream relevance task of image retrieval on MS-COCO dataset and qualitatively on MS-COCO and an Adobe Stock dataset.

下载PDF全文

下载文献需遵守相关版权规定

论文标题