论文标题

介意你的拐点!通过基本侵流编码改进非标准英语的NLP

Mind Your Inflections! Improving NLP for Non-Standard Englishes with Base-Inflection Encoding

论文作者

Tan, Samson, Joty, Shafiq, Varshney, Lav R., Kan, Min-Yen

论文摘要

拐点变化是世界英语的共同特征,例如口语新加坡英语和非裔美国人的白话英语。尽管人类读者的理解通常不会受到非标准变形的影响,但当前的NLP系统尚不强大。我们提出了基本反射编码(咬),这是一种通过将词汇为基础形式减少到其基本形式的方法,然后将语法信息重新注射为特殊符号。使用我们的编码防御侵蚀对手进行微调预审计的NLP模型,同时保持清洁数据的性能。使用咬合的模型可以更好地推广到具有非标准拐点的方言,而无需明确的训练和翻译模型在被咬合训练时会更快地收敛。最后,我们表明我们的编码提高了流行数据驱动的子字样的词汇效率。由于没有对词汇效率进行定量评估的事先工作,因此我们建议指标这样做。

Inflectional variation is a common feature of World Englishes such as Colloquial Singapore English and African American Vernacular English. Although comprehension by human readers is usually unimpaired by non-standard inflections, current NLP systems are not yet robust. We propose Base-Inflection Encoding (BITE), a method to tokenize English text by reducing inflected words to their base forms before reinjecting the grammatical information as special symbols. Fine-tuning pretrained NLP models for downstream tasks using our encoding defends against inflectional adversaries while maintaining performance on clean data. Models using BITE generalize better to dialects with non-standard inflections without explicit training and translation models converge faster when trained with BITE. Finally, we show that our encoding improves the vocabulary efficiency of popular data-driven subword tokenizers. Since there has been no prior work on quantitatively evaluating vocabulary efficiency, we propose metrics to do so.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源