论文标题
在Twitter上注释挪威语品种的词性
Annotating Norwegian Language Varieties on Twitter for Part-of-Speech
论文作者
论文摘要
挪威Twitter数据对自然语言处理(NLP)任务提出了一个有趣的挑战。这些文本很难在两种挪威书面形式之一(Bokmål和Nynorsk)之一中接受标准化文本训练的模型,因为它们既包含社交媒体文本的典型变化,又包含大量方言品种。在本文中,我们介绍了一个新颖的挪威Twitter数据集,并用POS标签注释。我们表明,针对该数据集进行评估时,经过通用依赖性(UD)数据训练的模型的性能差,并且在Bokmål训练的模型通常比在Nynorsk训练的模型表现更好。我们还看到,方言推文上的性能与某些模型的书面标准相当。最后,我们对模型在此数据上通常遇到的错误进行详细分析。
Norwegian Twitter data poses an interesting challenge for Natural Language Processing (NLP) tasks. These texts are difficult for models trained on standardized text in one of the two Norwegian written forms (Bokmål and Nynorsk), as they contain both the typical variation of social media text, as well as a large amount of dialectal variety. In this paper we present a novel Norwegian Twitter dataset annotated with POS-tags. We show that models trained on Universal Dependency (UD) data perform worse when evaluated against this dataset, and that models trained on Bokmål generally perform better than those trained on Nynorsk. We also see that performance on dialectal tweets is comparable to the written standards for some models. Finally we perform a detailed analysis of the errors that models commonly make on this data.