论文标题
在多语言Bert中,所有语言是否相等?
Are All Languages Created Equal in Multilingual BERT?
论文作者
论文摘要
接受了104种语言培训的多语言Bert(Mbert)在几个NLP任务上表现出了出人意料的跨语义表现,即使没有明确的跨语义信号。但是,这些评估的重点是用高资源语言的跨语性转移,仅涵盖了Mbert涵盖的三分之一语言。我们探讨了Mbert在更广泛的语言方面的表现,重点是通过语言性能来衡量的低资源语言的代表性质量。我们考虑三个任务:命名实体识别(99种语言),言论部分标记和依赖性解析(每种54种语言)。 Mbert的表现要好于高资源语言上的基线,但对于低资源语言而言,情况要差得多。此外,这些语言的单语BERT模型确实更糟。与类似的语言配对,单语伯特和姆伯特之间的性能差距可以缩小。我们发现,低资源语言的更好模型需要更有效的训练预处理技术或更多数据。
Multilingual BERT (mBERT) trained on 104 languages has shown surprisingly good cross-lingual performance on several NLP tasks, even without explicit cross-lingual signals. However, these evaluations have focused on cross-lingual transfer with high-resource languages, covering only a third of the languages covered by mBERT. We explore how mBERT performs on a much wider set of languages, focusing on the quality of representation for low-resource languages, measured by within-language performance. We consider three tasks: Named Entity Recognition (99 languages), Part-of-speech Tagging, and Dependency Parsing (54 languages each). mBERT does better than or comparable to baselines on high resource languages but does much worse for low resource languages. Furthermore, monolingual BERT models for these languages do even worse. Paired with similar languages, the performance gap between monolingual BERT and mBERT can be narrowed. We find that better models for low resource languages require more efficient pretraining techniques or more data.