评估工具包用于自动论文评分系统的鲁棒性测试

论文标题

评估工具包用于自动论文评分系统的鲁棒性测试

Evaluation Toolkit For Robustness Testing Of Automatic Essay Scoring Systems

论文作者

Kabra, Anubha, Bhatia, Mehar, Kumar, Yaman, Li, Junyi Jessy, Shah, Rajiv Ratn

论文摘要

在过去的三年中，自动评分引擎已被用于为大约1500万的测试商分数。由于19号以及教育和测试的相关自动化，该数字正在进一步增加。尽管使用如此广泛，但这些“智能”模型的基于AI的测试文献极为缺乏。提出新模型的大多数论文仅依赖于基于二次加权Kappa（QWK）与人类评估者的一致性，以显示模型功效。但是，这实际上忽略了论文评分的高度多功能性质。论文评分取决于连贯性，语法，相关性，足够和词汇等功能。迄今为止，尚无研究测试自动论文评分：AES系统从整体上讲所有这些功能。通过这种动机，我们提出了一种模型不可知的对抗评估方案以及AES系统的相关指标，以测试其自然语言理解能力和整体鲁棒性。我们使用建议的方案评估当前的最新AES模型，并在五个最近的模型上报告结果。这些模型从基于功能工程的方法到最新的深度学习算法。我们发现AES模型高度稳定。即使是与问题主题无关的内容的重大修改（多达25％）也不会降低模型产生的分数。另一方面，平均而言，无关紧要的内容增加了分数，因此表明应该重新考虑模型评估策略和专栏。我们还要求200名人类评估者对原始和对抗性的反应进行评分，以了解人类是否可以检测两者之间的差异以及他们是否同意由自动分数分配的分数。

Automatic scoring engines have been used for scoring approximately fifteen million test-takers in just the last three years. This number is increasing further due to COVID-19 and the associated automation of education and testing. Despite such wide usage, the AI-based testing literature of these "intelligent" models is highly lacking. Most of the papers proposing new models rely only on quadratic weighted kappa (QWK) based agreement with human raters for showing model efficacy. However, this effectively ignores the highly multi-feature nature of essay scoring. Essay scoring depends on features like coherence, grammar, relevance, sufficiency and, vocabulary. To date, there has been no study testing Automated Essay Scoring: AES systems holistically on all these features. With this motivation, we propose a model agnostic adversarial evaluation scheme and associated metrics for AES systems to test their natural language understanding capabilities and overall robustness. We evaluate the current state-of-the-art AES models using the proposed scheme and report the results on five recent models. These models range from feature-engineering-based approaches to the latest deep learning algorithms. We find that AES models are highly overstable. Even heavy modifications(as much as 25%) with content unrelated to the topic of the questions do not decrease the score produced by the models. On the other hand, irrelevant content, on average, increases the scores, thus showing that the model evaluation strategy and rubrics should be reconsidered. We also ask 200 human raters to score both an original and adversarial response to seeing if humans can detect differences between the two and whether they agree with the scores assigned by auto scores.

下载PDF全文

下载文献需遵守相关版权规定

论文标题