论文标题
自然语言理解中的数据增强型deNOSO
On-the-fly Denoising for Data Augmentation in Natural Language Understanding
论文作者
论文摘要
数据增强(DA)经常用于提供额外的培训数据,而无需自动额外注释。但是,数据增强可能会引入嘈杂的数据,以损害培训。为了确保增强数据的质量,现有方法要么假设增强数据中不存在噪音,要么采用一致性培训,或者使用简单的启发式方法,例如训练损失和多样性限制来滤除“嘈杂”数据。但是,这些过滤的示例可能仍然包含有用的信息,并完全丢弃它们会导致监督信号损失。在本文中,基于原始数据集比增强数据更干净的假设,我们提出了一种用于数据增强的固定去胶化技术,该技术是从软件教师模型提供的软增强标签中学习的。为了进一步防止在嘈杂的标签上过度拟合,应用一个简单的自我调节模块,以迫使模型预测在两个不同的辍学中保持一致。我们的方法可以应用于一般的增强技术,并始终提高文本分类和提问任务的性能。
Data Augmentation (DA) is frequently used to provide additional training data without extra human annotation automatically. However, data augmentation may introduce noisy data that impairs training. To guarantee the quality of augmented data, existing methods either assume no noise exists in the augmented data and adopt consistency training or use simple heuristics such as training loss and diversity constraints to filter out "noisy" data. However, those filtered examples may still contain useful information, and dropping them completely causes a loss of supervision signals. In this paper, based on the assumption that the original dataset is cleaner than the augmented data, we propose an on-the-fly denoising technique for data augmentation that learns from soft augmented labels provided by an organic teacher model trained on the cleaner original data. To further prevent overfitting on noisy labels, a simple self-regularization module is applied to force the model prediction to be consistent across two distinct dropouts. Our method can be applied to general augmentation techniques and consistently improve the performance on both text classification and question-answering tasks.