论文标题
部分可观测时空混沌系统的无模型预测
Subword Segmental Language Modelling for Nguni Languages
论文作者
论文摘要
子词已成为NLP中文本的标准单元,从而实现了有效的开放式摄影模型。使用诸如字节对编码(BPE)之类的算法,子词细分被视为在训练之前应用于语料库的预处理步骤。这可能会导致具有复杂形态的低资源语言的次级段。我们提出了一个子词分段语言模型(SSLM),该模型在接受自回归语言建模的同时学习如何细分单词。通过统一子词细分和语言建模,我们的模型学习了优化LM性能的子字。我们在南非的4种Nguni语言上训练模型。这些是低资源的凝集性语言,因此子词信息至关重要。作为LM,SSLM在4种语言中平均胜过现有方法,例如基于BPE的模型。此外,它在无监督的形态分段上优于标准子单词细分器。我们还将模型训练为单词级序列模型,从而导致了无监督的形态分段,该模型在所有4种语言中都超过了现有方法的大幅度。我们的结果表明,学习子词细分是现有子词细分器的有效替代方法,使该模型能够发现改善其LM功能的类似词素的子字。
Subwords have become the standard units of text in NLP, enabling efficient open-vocabulary models. With algorithms like byte-pair encoding (BPE), subword segmentation is viewed as a preprocessing step applied to the corpus before training. This can lead to sub-optimal segmentations for low-resource languages with complex morphologies. We propose a subword segmental language model (SSLM) that learns how to segment words while being trained for autoregressive language modelling. By unifying subword segmentation and language modelling, our model learns subwords that optimise LM performance. We train our model on the 4 Nguni languages of South Africa. These are low-resource agglutinative languages, so subword information is critical. As an LM, SSLM outperforms existing approaches such as BPE-based models on average across the 4 languages. Furthermore, it outperforms standard subword segmenters on unsupervised morphological segmentation. We also train our model as a word-level sequence model, resulting in an unsupervised morphological segmenter that outperforms existing methods by a large margin for all 4 languages. Our results show that learning subword segmentation is an effective alternative to existing subword segmenters, enabling the model to discover morpheme-like subwords that improve its LM capabilities.