论文标题

DUREADER_RETRIEVAL:从Web搜索引擎取回通道的大规模中文基准

DuReader_retrieval: A Large-scale Chinese Benchmark for Passage Retrieval from Web Search Engine

论文作者

Qiu, Yifu, Li, Hongyu, Qu, Yingqi, Chen, Ying, She, Qiaoqiao, Liu, Jing, Wu, Hua, Wang, Haifeng

论文摘要

在本文中,我们介绍了dureader_retrieval,这是一个大规模的中文数据集,用于通道检索。 dureader_retrieval包含超过90k的查询和商业搜索引擎中的800万以上唯一段落。为了减轻其他数据集的缺点并确保我们的基准质量,我们(1)通过手动注释从多个猎犬中汇总的结果来减少开发和测试集中的错误负面因素,以及(2)删除与开发和测试查询在语义上相似的培训查询。此外,我们提供两个用于跨域评估的室外测试集,以及一组人翻译的查询,以进行跨语言检索评估。该实验表明,杜拉德(Dureader_retrieval)具有挑战性,许多问题仍未解决,例如显着短语不匹配和查询和段落之间的句法错配。这些实验还表明,密集的猎犬并不能很好地跨越域,跨语义的检索本质上是具有挑战性的。 dureader_retrieval可在https://github.com/baidu/dureader/tree/master/master/dureader-retrieval上公开获得。

In this paper, we present DuReader_retrieval, a large-scale Chinese dataset for passage retrieval. DuReader_retrieval contains more than 90K queries and over 8M unique passages from a commercial search engine. To alleviate the shortcomings of other datasets and ensure the quality of our benchmark, we (1) reduce the false negatives in development and test sets by manually annotating results pooled from multiple retrievers, and (2) remove the training queries that are semantically similar to the development and testing queries. Additionally, we provide two out-of-domain testing sets for cross-domain evaluation, as well as a set of human translated queries for for cross-lingual retrieval evaluation. The experiments demonstrate that DuReader_retrieval is challenging and a number of problems remain unsolved, such as the salient phrase mismatch and the syntactic mismatch between queries and paragraphs. These experiments also show that dense retrievers do not generalize well across domains, and cross-lingual retrieval is essentially challenging. DuReader_retrieval is publicly available at https://github.com/baidu/DuReader/tree/master/DuReader-Retrieval.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源