论文标题
当更多的数据受到伤害时:开发宽覆盖的自然语言理解系统的令人不安的怪癖
When More Data Hurts: A Troubling Quirk in Developing Broad-Coverage Natural Language Understanding Systems
论文作者
论文摘要
在自然语言理解(NLU)生产系统中,用户不断发展的需求需要随着时间的推移添加新功能,并由添加到含义表示空间中的新符号索引。这需要其他培训数据,并导致不断增长的数据集。我们介绍了此增量符号学习方案的首次系统调查。我们的分析表明,构建宽覆盖的NLU系统时令人不安的怪癖:随着培训数据集的增长,如果我们没有相应地增加其培训数据,那么在新符号上的性能通常会降低。这表明,使用较大的培训数据集学习新符号变得更加困难。我们表明,在两个常见的NLU任务上的多个主流模型:意图识别和语义解析方面都有这种趋势。我们拒绝阶级失衡作为唯一的罪魁祸首,我们揭示了趋势与我们称为源信号稀释的效果密切相关,在训练数据集增长的情况下,新符号的强词词汇被稀释。有选择地放弃训练示例以防止稀释液通常会逆转趋势,显示出简单词汇提示上主流神经NLU模型的过度依赖。代码,模型和数据可在https://aka.ms/nlu-incremental-symbol-learning上找到
In natural language understanding (NLU) production systems, users' evolving needs necessitate the addition of new features over time, indexed by new symbols added to the meaning representation space. This requires additional training data and results in ever-growing datasets. We present the first systematic investigation of this incremental symbol learning scenario. Our analysis reveals a troubling quirk in building broad-coverage NLU systems: as the training dataset grows, performance on the new symbol often decreases if we do not accordingly increase its training data. This suggests that it becomes more difficult to learn new symbols with a larger training dataset. We show that this trend holds for multiple mainstream models on two common NLU tasks: intent recognition and semantic parsing. Rejecting class imbalance as the sole culprit, we reveal that the trend is closely associated with an effect we call source signal dilution, where strong lexical cues for the new symbol become diluted as the training dataset grows. Selectively dropping training examples to prevent dilution often reverses the trend, showing the over-reliance of mainstream neural NLU models on simple lexical cues. Code, models, and data are available at https://aka.ms/nlu-incremental-symbol-learning