野外语言ID：一千个语言Web文本语料库的意外挑战

论文标题

野外语言ID：一千个语言Web文本语料库的意外挑战

Language ID in the Wild: Unexpected Challenges on the Path to a Thousand-Language Web Text Corpus

论文作者

Caswell, Isaac, Breiner, Theresa, van Esch, Daan, Bapna, Ankur

论文摘要

大型文本语料库对于各种自然语言处理（NLP）任务越来越重要，并且自动语言识别（Langid）是在多语言上下文中收集此类数据集所需的核心技术。 Langid在文献中已被视为解决了，模型报道，在多达1,366种语言上，平均F1的平均F1超过90％。我们在高达1,629种语言上训练Langid模型，其质量可比固定的测试集，但发现使用这些模型创建的Web-Crawl Text Corpora被人为判断的Langid准确性仅在许多低资产阶级语言中大约为5％，这表明需要进行更强大的评估。进一步的分析揭示了各种误差模式，这些误差模式是由域不匹配，阶级失衡，语言相似性和不足表达模型引起的。我们提出了两类技术来减轻这些错误：基于WordList的可调式过滤器（我们以大约500种语言发布了精选的列表）和基于变形金刚的半监督Langid模型，这些模型将中位数数据集的精度从5.5％提高到71.2％。这些技术使我们能够创建一个涵盖500多种语言中每种句子的初始数据集，为1,000语言的Web文本语料库铺平了道路。

Large text corpora are increasingly important for a wide variety of Natural Language Processing (NLP) tasks, and automatic language identification (LangID) is a core technology needed to collect such datasets in a multilingual context. LangID is largely treated as solved in the literature, with models reported that achieve over 90% average F1 on as many as 1,366 languages. We train LangID models on up to 1,629 languages with comparable quality on held-out test sets, but find that human-judged LangID accuracy for web-crawl text corpora created using these models is only around 5% for many lower-resource languages, suggesting a need for more robust evaluation. Further analysis revealed a variety of error modes, arising from domain mismatch, class imbalance, language similarity, and insufficiently expressive models. We propose two classes of techniques to mitigate these errors: wordlist-based tunable-precision filters (for which we release curated lists in about 500 languages) and transformer-based semi-supervised LangID models, which increase median dataset precision from 5.5% to 71.2%. These techniques enable us to create an initial data set covering 100K or more relatively clean sentences in each of 500+ languages, paving the way towards a 1,000-language web text corpus.

下载PDF全文

下载文献需遵守相关版权规定

论文标题