非洲语言：非洲语言的神经语言识别工具

论文标题

非洲语言：非洲语言的神经语言识别工具

AfroLID: A Neural Language Identification Tool for African Languages

论文作者

Adebara, Ife, Elmadany, AbdelRahim, Abdul-Mageed, Muhammad, Inciarte, Alcides Alcoba

论文摘要

语言识别（LID）是NLP的关键先驱，尤其是针对挖掘Web数据。有问题的是，当今世界上大多数7000多种语言都没有被盖技术涵盖。我们通过引入Afrolid（以517美元的非洲语言和品种为$的神经盖工具包）来解决非洲的紧迫问题。 Afrolid利用了使用五个拼字法系统手动策划的多域Web数据集，该数据集是从14个语言家族中手动策划的。当在我们的盲验测试集上进行评估时，Afrolid达到95.89 F_1得分。我们还将Afrolid与五种现有的盖子工具进行了比较，每种盖子工具都涵盖了少数非洲语言，发现它的表现优于大多数语言。我们通过在急性服务不足的Twitter域进行测试，进一步显示了Afrolid在野外的效用。最后，我们提供了许多受控的案例研究并进行语言动机的错误分析，使我们既展示了Afrolid的强大功能和局限性。

Language identification (LID) is a crucial precursor for NLP, especially for mining web data. Problematically, most of the world's 7000+ languages today are not covered by LID technologies. We address this pressing issue for Africa by introducing AfroLID, a neural LID toolkit for $517$ African languages and varieties. AfroLID exploits a multi-domain web dataset manually curated from across 14 language families utilizing five orthographic systems. When evaluated on our blind Test set, AfroLID achieves 95.89 F_1-score. We also compare AfroLID to five existing LID tools that each cover a small number of African languages, finding it to outperform them on most languages. We further show the utility of AfroLID in the wild by testing it on the acutely under-served Twitter domain. Finally, we offer a number of controlled case studies and perform a linguistically-motivated error analysis that allow us to both showcase AfroLID's powerful capabilities and limitations.

下载PDF全文

下载文献需遵守相关版权规定

论文标题