论文标题
超越对比度学习:多语言检索的变分生成模型
Beyond Contrastive Learning: A Variational Generative Model for Multilingual Retrieval
论文作者
论文摘要
对比度学习已成功地用于检索语义对齐的句子,但通常需要大批量的大小或仔细的工程才能正常工作。在本文中,我们提出了一个用于学习多语言文本嵌入的生成模型,该模型可用于检索或分数对。我们的模型以$ n $语言的并行数据运行,通过近似,我们介绍了这种多语言设置中的源分离,从而将语义信息分开,这些语义信息与风格或特定于语言的变化之间共享。我们在学习多种语言文本嵌入的方法和基于世代的方法之间进行了仔细的大规模比较,尽管这些方法的普及,但这种比较并未以我们的最佳知识进行。我们在一系列任务上评估了这种方法,包括语义相似性,bitext挖掘和跨语言问题检索 - 我们在本文中介绍了最后一篇。总体而言,我们的多语言源分离变压器(VMSST)模型在这些任务上的强大和生成基线的表现都优于强度。
Contrastive learning has been successfully used for retrieval of semantically aligned sentences, but it often requires large batch sizes or careful engineering to work well. In this paper, we instead propose a generative model for learning multilingual text embeddings which can be used to retrieve or score sentence pairs. Our model operates on parallel data in $N$ languages and, through an approximation we introduce, efficiently encourages source separation in this multilingual setting, separating semantic information that is shared between translations from stylistic or language-specific variation. We show careful large-scale comparisons between contrastive and generation-based approaches for learning multilingual text embeddings, a comparison that has not been done to the best of our knowledge despite the popularity of these approaches. We evaluate this method on a suite of tasks including semantic similarity, bitext mining, and cross-lingual question retrieval -- the last of which we introduce in this paper. Overall, our Variational Multilingual Source-Separation Transformer (VMSST) model outperforms both a strong contrastive and generative baseline on these tasks.