论文标题
跨视觉语言建模:迈向统一的跨语言跨模式预训练
Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training
论文作者
论文摘要
在本文中,我们介绍了跨视觉语言建模,这是一个简单有效的预训练框架,将跨语义和跨模式的预训练与共享体系结构和目标统一。我们的方法是由一个关键观察结果激发的,即跨语性和跨模式的预训练共享相同的目标,即将同一对象的两个不同视图对齐为共同的语义空间。为此,跨视觉语言建模框架将多模式数据(即图像捕获对)和多语言数据(即并行句子对)视为同一对象的两个不同视图,并训练模型以使它们之间的两个视图最大化它们之间的相互信息,以使它们之间的两个视图与有条理的掩护语言模型和相反的学习来调整两个视图。我们预先培训CCLM是一种跨语义跨模式模型,具有跨视图的语言建模框架。关于Iglue的经验结果,一种多语言多模式基准和两个多语言图像文本检索数据集表明,虽然从概念上讲,CCLM在概念上的表现明显优于先前的最新一次,而平均绝对可绝对改善超过10%。此外,CCLM是第一个通过零摄像的跨语言传输超过代表性英语视觉模型的翻译测试性能的多模式预训练的模型。
In this paper, we introduce Cross-View Language Modeling, a simple and effective pre-training framework that unifies cross-lingual and cross-modal pre-training with shared architectures and objectives. Our approach is motivated by a key observation that cross-lingual and cross-modal pre-training share the same goal of aligning two different views of the same object into a common semantic space. To this end, the cross-view language modeling framework considers both multi-modal data (i.e., image-caption pairs) and multi-lingual data (i.e., parallel sentence pairs) as two different views of the same object, and trains the model to align the two views by maximizing the mutual information between them with conditional masked language modeling and contrastive learning. We pre-train CCLM, a Cross-lingual Cross-modal Language Model, with the cross-view language modeling framework. Empirical results on IGLUE, a multi-lingual multi-modal benchmark, and two multi-lingual image-text retrieval datasets show that while conceptually simpler, CCLM significantly outperforms the prior state-of-the-art with an average absolute improvement of over 10%. Moreover, CCLM is the first multi-lingual multi-modal pre-trained model that surpasses the translate-test performance of representative English vision-language models by zero-shot cross-lingual transfer.