论文标题

使用子字tf-idf的多语言搜索

Multilingual Search with Subword TF-IDF

论文作者

Wangperawong, Artit

论文摘要

可以通过子字令牌化来实现多语言搜索。传统的TF-IDF方法的准确性取决于手动策划的令牌化,停止单词和茎规则,而子单词TF-IDF(STF-IDF)可以提供更高的准确性,而无需这样的启发式方法。此外,可以固有地将多语言支持作为子词令牌化模型培训的一部分合并。 Xquad评估证明了STF-IDF的优势:英语的出色信息检索准确性为85.4%,其他10种语言的80%以上,而没有任何基于启发式的预处理。重现这些结果的软件是作为text2text的一部分开源的:https://github.com/artitw/text2text

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of Text2Text: https://github.com/artitw/text2text

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源