雷克马：词典 - 底层预处理用于大规模检索

论文标题

雷克马：词典 - 底层预处理用于大规模检索

LexMAE: Lexicon-Bottlenecked Pretraining for Large-Scale Retrieval

论文作者

Shen, Tao, Geng, Xiubo, Tao, Chongyang, Xu, Can, Huang, Xiaolong, Jiao, Binxing, Yang, Linjun, Jiang, Daxin

论文摘要

在大规模检索中，词汇加权范式在词汇空间中学习加权稀疏表示，显示出高质量和低潜伏期的有希望的结果。尽管它深深利用了预培训的语言模型的词典陈述能力，但在语言建模和词典赋予检索中仍然存在一个关键的差距 - 前者更喜欢某些或低渗透单词，而后者则有利于枢轴或高输入词 - 成为词汇量较大的障碍，以使其成为词汇范围的障碍。为了弥合这一差距，我们提出了一个全新的预训练框架，词典 - 底层蒙面的自动编码器（Lexmae），以了解重要性意义的词典代表。从本质上讲，我们在普通语言建模编码器和弱化的解码器之间介绍了词典 - 底层模块，其中建造了一个连续的单词瓶装瓶颈，以学习词典以一种不受欢迎的方式学习词典的重要性分布。预先训练的词典很容易通过微调转移到词典加权检索中。在MS-Marco的临时检索基准测试中，它以45.8 QPS的通道数据集实现42.6％的MRR@10，由CPU计算机使用44.4％的MRR@100，使用134.8 QPS使用文档数据集。雷克萨马（Lexmae）在贝尔（Beir）基准测试中显示了最新的零拍传输能力，并显示了12个数据集。

In large-scale retrieval, the lexicon-weighting paradigm, learning weighted sparse representations in vocabulary space, has shown promising results with high quality and low latency. Despite it deeply exploiting the lexicon-representing capability of pre-trained language models, a crucial gap remains between language modeling and lexicon-weighting retrieval -- the former preferring certain or low-entropy words whereas the latter favoring pivot or high-entropy words -- becoming the main barrier to lexicon-weighting performance for large-scale retrieval. To bridge this gap, we propose a brand-new pre-training framework, lexicon-bottlenecked masked autoencoder (LexMAE), to learn importance-aware lexicon representations. Essentially, we present a lexicon-bottlenecked module between a normal language modeling encoder and a weakened decoder, where a continuous bag-of-words bottleneck is constructed to learn a lexicon-importance distribution in an unsupervised fashion. The pre-trained LexMAE is readily transferred to the lexicon-weighting retrieval via fine-tuning. On the ad-hoc retrieval benchmark, MS-Marco, it achieves 42.6% MRR@10 with 45.8 QPS for the passage dataset and 44.4% MRR@100 with 134.8 QPS for the document dataset, by a CPU machine. And LexMAE shows state-of-the-art zero-shot transfer capability on BEIR benchmark with 12 datasets.

下载PDF全文

下载文献需遵守相关版权规定

论文标题