论文标题

大规模双语语言图像对比度学习

Large-scale Bilingual Language-Image Contrastive Learning

论文作者

Ko, Byungsoo, Gu, Geonmo

论文摘要

本文是一份技术报告,可以分享我们的经验和发现建立韩国和英语双语多模型。尽管许多多模式数据集都集中在英语和多语言多模式研究上,但使用机器翻译的文本使用机器翻译文本,仅限于描述独特的表达式,文化信息和英语以外的语言中的适当名词。在这项工作中,我们收集了11亿张图像文本对(7.08亿韩语和4.76亿英语),并培训了名为Kelip的双语多模式。我们介绍了简单而有效的培训方案,包括MAE预训练和多曲线增强。广泛的实验表明,经过此类培训方案训练的模型在两种语言中都表现出竞争性能。此外,我们讨论了与多模式相关的研究问题:1)基于强大的增强方法可以分散模型的注意力,以学习适当的多模式关系; 2)训练没有跨语性关系的多模型可以通过视觉语义来学习关系; 3)我们的双语kelip可以捕获与单词相同含义的视觉语义的文化差异; 4)大规模的多模型模型可用于多模式特征类比。我们希望这项工作将为未来的研究提供有用的经验和发现。我们提供开源的预训练kelip。

This paper is a technical report to share our experience and findings building a Korean and English bilingual multimodal model. While many of the multimodal datasets focus on English and multilingual multimodal research uses machine-translated texts, employing such machine-translated texts is limited to describing unique expressions, cultural information, and proper noun in languages other than English. In this work, we collect 1.1 billion image-text pairs (708 million Korean and 476 million English) and train a bilingual multimodal model named KELIP. We introduce simple yet effective training schemes, including MAE pre-training and multi-crop augmentation. Extensive experiments demonstrate that a model trained with such training schemes shows competitive performance in both languages. Moreover, we discuss multimodal-related research questions: 1) strong augmentation-based methods can distract the model from learning proper multimodal relations; 2) training multimodal model without cross-lingual relation can learn the relation via visual semantics; 3) our bilingual KELIP can capture cultural differences of visual semantics for the same meaning of words; 4) a large-scale multimodal model can be used for multimodal feature analogy. We hope that this work will provide helpful experience and findings for future research. We provide an open-source pre-trained KELIP.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源