使用子字tf-idf的多语言搜索

论文标题

使用子字tf-idf的多语言搜索

Multilingual Search with Subword TF-IDF

论文作者

Wangperawong, Artit

论文摘要

可以通过子字令牌化来实现多语言搜索。传统的TF-IDF方法的准确性取决于手动策划的令牌化，停止单词和茎规则，而子单词TF-IDF（STF-IDF）可以提供更高的准确性，而无需这样的启发式方法。此外，可以固有地将多语言支持作为子词令牌化模型培训的一部分合并。 Xquad评估证明了STF-IDF的优势：英语的出色信息检索准确性为85.4％，其他10种语言的80％以上，而没有任何基于启发式的预处理。重现这些结果的软件是作为text2text的一部分开源的：https：//github.com/artitw/text2text

Multilingual search can be achieved with subword tokenization. The accuracy of traditional TF-IDF approaches depend on manually curated tokenization, stop words and stemming rules, whereas subword TF-IDF (STF-IDF) can offer higher accuracy without such heuristics. Moreover, multilingual support can be incorporated inherently as part of the subword tokenization model training. XQuAD evaluation demonstrates the advantages of STF-IDF: superior information retrieval accuracy of 85.4% for English and over 80% for 10 other languages without any heuristics-based preprocessing. The software to reproduce these results are open-sourced as a part of Text2Text: https://github.com/artitw/text2text

下载PDF全文

下载文献需遵守相关版权规定

论文标题