与噪音学习的跨语性跨模式检索

论文标题

与噪音学习的跨语性跨模式检索

Cross-Lingual Cross-Modal Retrieval with Noise-Robust Learning

论文作者

Wang, Yabing, Dong, Jianfeng, Liang, Tianxiang, Zhang, Minsong, Cai, Rui, Wang, Xun

论文摘要

尽管最近在跨模式检索领域取得了进展，但由于缺乏手动注释的数据集，研究的重点较少。在本文中，我们提出了一种用于低资源语言的噪声跨语义跨模式检索方法。为此，我们使用机器翻译（MT）来构造低资源语言的伪并行句子对。但是，由于MT并不完美，因此它倾向于在翻译过程中引入噪音，从而使文本嵌入被损坏，从而损害了检索性能。为了减轻这一点，我们介绍了一种多视图自我验证方法来学习噪声般的目标语言表示，该方法采用了跨意义的模块来生成软伪靶标，以从基于相似性的视图和基于功能的视图中提供直接监督。此外，受到无监督MT的反向翻译的启发，我们最大程度地减少了原点句子和反翻译句子之间的语义差异，以进一步提高文本编码器的噪声稳健性。在三个视频文本和图像文本跨模式检索基准跨不同语言上进行了广泛的实验，结果表明，我们的方法显着改善了整体性能，而无需使用额外的人体标记数据。此外，从最近的视觉和语言预训练框架（即剪辑）中配备了预训练的视觉编码器，我们的模型可实现显着的性能增长，这表明我们的方法与流行的预训练模型兼容。代码和数据可在https://github.com/huiguanlab/nrccr上找到。

Despite the recent developments in the field of cross-modal retrieval, there has been less research focusing on low-resource languages due to the lack of manually annotated datasets. In this paper, we propose a noise-robust cross-lingual cross-modal retrieval method for low-resource languages. To this end, we use Machine Translation (MT) to construct pseudo-parallel sentence pairs for low-resource languages. However, as MT is not perfect, it tends to introduce noise during translation, rendering textual embeddings corrupted and thereby compromising the retrieval performance. To alleviate this, we introduce a multi-view self-distillation method to learn noise-robust target-language representations, which employs a cross-attention module to generate soft pseudo-targets to provide direct supervision from the similarity-based view and feature-based view. Besides, inspired by the back-translation in unsupervised MT, we minimize the semantic discrepancies between origin sentences and back-translated sentences to further improve the noise robustness of the textual encoder. Extensive experiments are conducted on three video-text and image-text cross-modal retrieval benchmarks across different languages, and the results demonstrate that our method significantly improves the overall performance without using extra human-labeled data. In addition, equipped with a pre-trained visual encoder from a recent vision-and-language pre-training framework, i.e., CLIP, our model achieves a significant performance gain, showing that our method is compatible with popular pre-training models. Code and data are available at https://github.com/HuiGuanLab/nrccr.

下载PDF全文

下载文献需遵守相关版权规定

论文标题