论文标题
噪音射击式驱逐删除
Noise-Robust De-Duplication at Scale
论文作者
论文摘要
在大型嘈杂的文本语料库中识别附近的重复项,具有无数的应用程序,范围从删除培训数据集,降低隐私风险以及评估测试集泄漏到识别大型语料库中复制的新闻文章和文献。在这些不同的应用程序中,绝大多数工作都依赖于n-gram。已经做出了有限的努力来评估N-Gram方法的性能,部分原因是尚不清楚如何为大型语料库创建无偏见的评估数据集。这项研究使用历史新闻线的独特及时性创建了一个27,210个文档数据集,具有122,876个正重复对,用于研究噪声射击的删除。新闻的时间敏感性使得全面的手动标记可行 - 尽管语料库的整体规模庞大 - 重复的日期范围内。然后,该研究开发并评估了一系列删除方法:哈希和n-gram重叠(在文献中占主导),对比训练的双重编码器,以及结合了BI-和交叉编码器的重新划分方法。神经方法的表现明显优于哈希和N-gram重叠。我们表明,双重编码器尺寸很好,在不到几个小时内就可以在单个GPU卡上进行一千万篇文章。我们还将预训练的模型应用于C4的Realnews和专利部分(巨大的清洁爬行语料库),这说明神经方法可以在存在各种噪音的情况下识别出许多因散列而错过的近乎重复的人。我们的新闻副本删除数据集,代码库和预培训模型的公开发布将有助于进一步的研究和应用。
Identifying near duplicates within large, noisy text corpora has a myriad of applications that range from de-duplicating training datasets, reducing privacy risk, and evaluating test set leakage, to identifying reproduced news articles and literature within large corpora. Across these diverse applications, the overwhelming majority of work relies on N-grams. Limited efforts have been made to evaluate how well N-gram methods perform, in part because it is unclear how one could create an unbiased evaluation dataset for a massive corpus. This study uses the unique timeliness of historical news wires to create a 27,210 document dataset, with 122,876 positive duplicate pairs, for studying noise-robust de-duplication. The time-sensitivity of news makes comprehensive hand labelling feasible - despite the massive overall size of the corpus - as duplicates occur within a narrow date range. The study then develops and evaluates a range of de-duplication methods: hashing and N-gram overlap (which predominate in the literature), a contrastively trained bi-encoder, and a re-rank style approach combining a bi- and cross-encoder. The neural approaches significantly outperform hashing and N-gram overlap. We show that the bi-encoder scales well, de-duplicating a 10 million article corpus on a single GPU card in a matter of hours. We also apply our pre-trained model to the RealNews and patent portions of C4 (Colossal Clean Crawled Corpus), illustrating that a neural approach can identify many near duplicates missed by hashing, in the presence of various types of noise. The public release of our NEWS-COPY de-duplication dataset, codebase, and the pre-trained models will facilitate further research and applications.