论文标题
使用图神经网络的多模式检索
Multi-Modal Retrieval using Graph Neural Networks
论文作者
论文摘要
图像检索的大多数现实世界应用,例如Adobe Stock(是库存摄影和插图)的市场,需要一种方法让用户找到一种在视觉上(即美学上)和概念上(即包含相同的明显对象)作为查询图像的图像。从图像中学习视觉语义表示是一个精心研究的图像检索问题。传统上,基于图像概念或属性的过滤是通过基于索引的过滤(例如,在文本标签上)或在基于初始视觉嵌入的检索后重新排列来实现的。在本文中,我们学习了在相同的高维空间中嵌入的联合视觉和概念。该联合模型可为用户对结果集的语义进行细粒度的控制,从而使他们能够更迅速地探索图像的目录。我们将视觉和概念关系建模为图形结构,它通过节点邻域捕获了丰富的信息。该图结构有助于我们使用图神经网络学习多模式节点嵌入。我们还基于选择性邻域连接,允许用户控制检索算法,从而引入了一种新颖的推理时间控制。我们在MS-Coco数据集上的下游相关任务上定量评估了这些多模式嵌入,并在MS-Coco和Adobe Stock DataSet上定性地评估了这些嵌入。
Most real world applications of image retrieval such as Adobe Stock, which is a marketplace for stock photography and illustrations, need a way for users to find images which are both visually (i.e. aesthetically) and conceptually (i.e. containing the same salient objects) as a query image. Learning visual-semantic representations from images is a well studied problem for image retrieval. Filtering based on image concepts or attributes is traditionally achieved with index-based filtering (e.g. on textual tags) or by re-ranking after an initial visual embedding based retrieval. In this paper, we learn a joint vision and concept embedding in the same high-dimensional space. This joint model gives the user fine-grained control over the semantics of the result set, allowing them to explore the catalog of images more rapidly. We model the visual and concept relationships as a graph structure, which captures the rich information through node neighborhood. This graph structure helps us learn multi-modal node embeddings using Graph Neural Networks. We also introduce a novel inference time control, based on selective neighborhood connectivity allowing the user control over the retrieval algorithm. We evaluate these multi-modal embeddings quantitatively on the downstream relevance task of image retrieval on MS-COCO dataset and qualitatively on MS-COCO and an Adobe Stock dataset.