Aladin：提炼细颗粒的对准分数，以进行有效的图像文本匹配和检索

论文标题

Aladin：提炼细颗粒的对准分数，以进行有效的图像文本匹配和检索

ALADIN: Distilling Fine-grained Alignment Scores for Efficient Image-Text Matching and Retrieval

论文作者

Messina, Nicola, Stefanini, Matteo, Cornia, Marcella, Baraldi, Lorenzo, Falchi, Fabrizio, Amato, Giuseppe, Cucchiara, Rita

论文摘要

图像文本匹配是在涉及对视觉和语言的共同理解的任务中发挥了领导作用。在文献中，此任务通常被用作培训能够共同处理图像和文本的架构的预训练目标。但是，它具有直接的下游应用程序：跨模式检索，其中包括查找与给定查询文本或反之亦然相关的图像。解决此任务对于跨模式搜索引擎至关重要。许多最近的方法提出了针对图像文本匹配问题的有效解决方案，主要是使用最近的大型视觉语言（VL）变压器网络。但是，这些模型通常在计算上很昂贵，尤其是在推理时间。这样可以防止他们在大规模的跨模式检索场景中采用，几乎应该立即向用户提供结果。在本文中，我们建议通过提出对齐和提炼网络（Aladin）来填补有效性和效率之间的空白。阿拉丁首先通过在细粒度的图像和文本上对齐来产生高效的分数。然后，它通过提取从细粒对齐的相关性分数来提取相关性分数，从而学习一个共享的嵌入空间 - 可以进行有效的KNN搜索。我们在MS-Coco上取得了显着的结果，表明我们的方法可以与最先进的VL变压器竞争，同时又快到90倍。复制我们结果的代码可在https://github.com/mesnico/aladin上获得。

Image-text matching is gaining a leading role among tasks involving the joint understanding of vision and language. In literature, this task is often used as a pre-training objective to forge architectures able to jointly deal with images and texts. Nonetheless, it has a direct downstream application: cross-modal retrieval, which consists in finding images related to a given query text or vice-versa. Solving this task is of critical importance in cross-modal search engines. Many recent methods proposed effective solutions to the image-text matching problem, mostly using recent large vision-language (VL) Transformer networks. However, these models are often computationally expensive, especially at inference time. This prevents their adoption in large-scale cross-modal retrieval scenarios, where results should be provided to the user almost instantaneously. In this paper, we propose to fill in the gap between effectiveness and efficiency by proposing an ALign And DIstill Network (ALADIN). ALADIN first produces high-effective scores by aligning at fine-grained level images and texts. Then, it learns a shared embedding space - where an efficient kNN search can be performed - by distilling the relevance scores obtained from the fine-grained alignments. We obtained remarkable results on MS-COCO, showing that our method can compete with state-of-the-art VL Transformers while being almost 90 times faster. The code for reproducing our results is available at https://github.com/mesnico/ALADIN.

下载PDF全文

下载文献需遵守相关版权规定

论文标题