论文标题
无监督的上下文意识到的句子表示,用于多语言密集检索
Unsupervised Context Aware Sentence Representation Pretraining for Multi-lingual Dense Retrieval
论文作者
论文摘要
最近的研究表明,使用预审前的语言模型(PLM)来改善密集检索和多语言密集检索。在这项工作中,我们提出了一个简单但有效的单语言审计任务,称为对比上下文预测〜(CCP),通过对句子级别的上下文关系进行建模来学习句子表示。通过将句子的嵌入在本地上下文中的嵌入并将随机的负面样本推开,不同的语言可以形成同构结构,然后将自动对齐两种不同语言的句子对。我们的实验表明,模型崩溃和信息泄漏在语言模型的对比培训期间很容易发生,但是语言特定的记忆库和不对称的批处理归一化操作分别在防止崩溃和信息泄漏方面起着至关重要的作用。此外,嵌入句子的后处理也非常有效地取得更好的检索性能。在多语言句子检索任务tatoeba上,我们的模型在不使用双语数据的情况下在方法中实现了新的SOTA结果。当非英语对之间传递时,我们的模型还显示了tatoeba的更大收益。在两个多语性查询 - 通过检索任务(XOR检索和Tydi先生)中,我们的模型甚至达到了两个SOTA,在使用双语数据的所有预训练模型中都可以在零射击和监督的设置中获得零射击。
Recent research demonstrates the effectiveness of using pretrained language models (PLM) to improve dense retrieval and multilingual dense retrieval. In this work, we present a simple but effective monolingual pretraining task called contrastive context prediction~(CCP) to learn sentence representation by modeling sentence level contextual relation. By pushing the embedding of sentences in a local context closer and pushing random negative samples away, different languages could form isomorphic structure, then sentence pairs in two different languages will be automatically aligned. Our experiments show that model collapse and information leakage are very easy to happen during contrastive training of language model, but language-specific memory bank and asymmetric batch normalization operation play an essential role in preventing collapsing and information leakage, respectively. Besides, a post-processing for sentence embedding is also very effective to achieve better retrieval performance. On the multilingual sentence retrieval task Tatoeba, our model achieves new SOTA results among methods without using bilingual data. Our model also shows larger gain on Tatoeba when transferring between non-English pairs. On two multi-lingual query-passage retrieval tasks, XOR Retrieve and Mr.TYDI, our model even achieves two SOTA results in both zero-shot and supervised setting among all pretraining models using bilingual data.