顶峰：通过文档扩展的密集检索的课程抽样

论文标题

顶峰：通过文档扩展的密集检索的课程抽样

CAPSTONE: Curriculum Sampling for Dense Retrieval with Document Expansion

论文作者

He, Xingwei, Gong, Yeyun, Jin, A-Long, Zhang, Hang, Dong, Anlei, Jiao, Jian, Yiu, Siu Ming, Duan, Nan

论文摘要

双重编码器已成为致密检索的事实上的建筑。通常，它可以独立计算查询的潜在表示，因此无法完全捕获查询和文档之间的交互。为了减轻这一点，最近的研究重点是获取信息信息的文档表示形式。在培训期间，它可以通过实际查询扩展文档，但是在推断期间，它用生成的查询代替了实际查询。训练和推理之间的这种不一致导致密集的检索模型在计算文档表示时忽略文档的情况下优先考虑查询信息。因此，它的性能比香草密集的检索模型还要糟糕，因为它的性能在很大程度上依赖于询问与真实查询之间的相关性。在本文中，我们提出了一种课程抽样策略，该策略利用了培训期间的伪查询，并逐步增强了生成的Query Query Query Query和Real Query之间的相关性。通过这样做，检索模型学会了将其注意力从文档中扩展到文档和查询，从而产生了高质量的查询文档表示。对内域和室外数据集的实验结果表明，我们的方法的表现优于先前的密集检索模型。

The dual-encoder has become the de facto architecture for dense retrieval. Typically, it computes the latent representations of the query and document independently, thus failing to fully capture the interactions between the query and document. To alleviate this, recent research has focused on obtaining query-informed document representations. During training, it expands the document with a real query, but during inference, it replaces the real query with a generated one. This inconsistency between training and inference causes the dense retrieval model to prioritize query information while disregarding the document when computing the document representation. Consequently, it performs even worse than the vanilla dense retrieval model because its performance heavily relies on the relevance between the generated queries and the real query.In this paper, we propose a curriculum sampling strategy that utilizes pseudo queries during training and progressively enhances the relevance between the generated query and the real query. By doing so, the retrieval model learns to extend its attention from the document alone to both the document and query, resulting in high-quality query-informed document representations. Experimental results on both in-domain and out-of-domain datasets demonstrate that our approach outperforms previous dense retrieval models.

下载PDF全文

下载文献需遵守相关版权规定

论文标题