论文标题
评估神经语言建模中的分布失真
Evaluating Distributional Distortion in Neural Language Modeling
论文作者
论文摘要
自然语言的基本特征是说话者产生新颖表情的高速度。由于这种新颖性,大量的罕见事件占语言中分布的总概率质量的很大数量(Baayen,2001)。诸如困惑之类的标准语言建模指标量化了语言模型(LM)的总体性能。结果,我们对神经LMS是否准确估计罕见事件中序列的概率的理解相对较少。为了解决这一差距,我们开发了一种受控的评估方案,该方案使用对自然数据训练的生成模型作为人工语言,我们可以从中准确地计算序列概率。从这些人工语言中培训LMS,我们将LMS给出的序列级概率估计与目标语言的真实概率进行了比较。我们的实验表明,LSTM和Transformer语言模型(i)系统地低估了从目标语言绘制的序列的概率,并且(ii)对于较低的序列进行了更严重的操作。 (iii)研究了这种概率质量的去向,我们发现LMS倾向于高估了形成(扰动)序列的概率。此外,我们发现这种低估行为(IV)被削弱,但不会被更多的训练数据所消除,并且(v)对具有较低熵的目标分布加剧。
A fundamental characteristic of natural language is the high rate at which speakers produce novel expressions. Because of this novelty, a heavy-tail of rare events accounts for a significant amount of the total probability mass of distributions in language (Baayen, 2001). Standard language modeling metrics such as perplexity quantify the performance of language models (LM) in aggregate. As a result, we have relatively little understanding of whether neural LMs accurately estimate the probability of sequences in this heavy-tail of rare events. To address this gap, we develop a controlled evaluation scheme which uses generative models trained on natural data as artificial languages from which we can exactly compute sequence probabilities. Training LMs on generations from these artificial languages, we compare the sequence-level probability estimates given by LMs to the true probabilities in the target language. Our experiments reveal that LSTM and Transformer language models (i) systematically underestimate the probability of sequences drawn from the target language, and (ii) do so more severely for less-probable sequences. Investigating where this probability mass went, (iii) we find that LMs tend to overestimate the probability of ill formed (perturbed) sequences. In addition, we find that this underestimation behaviour (iv) is weakened, but not eliminated by greater amounts of training data, and (v) is exacerbated for target distributions with lower entropy.