论文标题
DQI:测量NLP的数据质量
DQI: Measuring Data Quality in NLP
论文作者
论文摘要
神经语言模型已在几个NLP数据集中达到了人类水平的表现。但是,最近的研究表明,这些模型并不是真正地学习所需的任务。相反,它们的高性能归因于使用虚假偏见过度适应,这表明AI系统的功能已被过高估计。我们引入了数据质量索引(DQI)的通用公式,以帮助数据集创建者免费创建具有此类不良偏见的数据集。我们使用最近提出的对抗过滤的方法对此公式进行了评估。我们使用DQI提出了一个新的数据创建范式来创建更高质量的数据。数据创建范式包括几个数据可视化,以帮助数据创建者(i)了解数据的质量,以及(ii)可视化创建的数据实例对整体质量的影响。它还具有几种自动化方法来协助数据创建者,并且(ii)使模型对对抗性攻击更加可靠。我们使用DQI以及这些自动化方法来翻新SNLI中的偏置示例。我们表明,在经过翻新的SNLI数据集上训练的模型更好地超出了分发任务。翻新导致模型性能降低,暴露于人类绩效方面的较大差距。 DQI系统地有助于使用主动学习来创建更艰难的基准。我们的工作采用了动态数据集创建的过程,其中数据集与不断发展的最新状态一起发展,因此可以作为基准AI的真实进步的一种手段。
Neural language models have achieved human level performance across several NLP datasets. However, recent studies have shown that these models are not truly learning the desired task; rather, their high performance is attributed to overfitting using spurious biases, which suggests that the capabilities of AI systems have been over-estimated. We introduce a generic formula for Data Quality Index (DQI) to help dataset creators create datasets free of such unwanted biases. We evaluate this formula using a recently proposed approach for adversarial filtering, AFLite. We propose a new data creation paradigm using DQI to create higher quality data. The data creation paradigm consists of several data visualizations to help data creators (i) understand the quality of data and (ii) visualize the impact of the created data instance on the overall quality. It also has a couple of automation methods to (i) assist data creators and (ii) make the model more robust to adversarial attacks. We use DQI along with these automation methods to renovate biased examples in SNLI. We show that models trained on the renovated SNLI dataset generalize better to out of distribution tasks. Renovation results in reduced model performance, exposing a large gap with respect to human performance. DQI systematically helps in creating harder benchmarks using active learning. Our work takes the process of dynamic dataset creation forward, wherein datasets evolve together with the evolving state of the art, therefore serving as a means of benchmarking the true progress of AI.