大规模双语语言图像对比度学习

论文标题

大规模双语语言图像对比度学习

Large-scale Bilingual Language-Image Contrastive Learning

论文作者

Ko, Byungsoo, Gu, Geonmo

论文摘要

本文是一份技术报告，可以分享我们的经验和发现建立韩国和英语双语多模型。尽管许多多模式数据集都集中在英语和多语言多模式研究上，但使用机器翻译的文本使用机器翻译文本，仅限于描述独特的表达式，文化信息和英语以外的语言中的适当名词。在这项工作中，我们收集了11亿张图像文本对（7.08亿韩语和4.76亿英语），并培训了名为Kelip的双语多模式。我们介绍了简单而有效的培训方案，包括MAE预训练和多曲线增强。广泛的实验表明，经过此类培训方案训练的模型在两种语言中都表现出竞争性能。此外，我们讨论了与多模式相关的研究问题：1）基于强大的增强方法可以分散模型的注意力，以学习适当的多模式关系； 2）训练没有跨语性关系的多模型可以通过视觉语义来学习关系； 3）我们的双语kelip可以捕获与单词相同含义的视觉语义的文化差异； 4）大规模的多模型模型可用于多模式特征类比。我们希望这项工作将为未来的研究提供有用的经验和发现。我们提供开源的预训练kelip。

This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP.

下载PDF全文

下载文献需遵守相关版权规定

论文标题