摩卡咖啡：用于培训和评估生成阅读理解指标的数据集

论文标题

摩卡咖啡：用于培训和评估生成阅读理解指标的数据集

MOCHA: A Dataset for Training and Evaluating Generative Reading Comprehension Metrics

论文作者

Chen, Anthony, Stanovsky, Gabriel, Singh, Sameer, Gardner, Matt

论文摘要

提出阅读理解作为一代问题，提供了很大的灵活性，允许对可能的答案限制几乎没有限制的开放式问题。但是，现有的一代指标阻碍了进度，这些指标依赖令牌重叠，对阅读理解的细微差别不可知。为了解决这个问题，我们引入了一个基准，用于培训和评估生成阅读理解指标：通过人类注释对正确性进行建模。摩卡咖啡包含40k人类的判断分数，该分数来自6个不同问题的模型输出，回答数据集以及一组最小对的评估对。使用摩卡咖啡，我们训练一个学识渊博的评估指标，用于阅读理解，lerc，以模仿人类的判断分数。 LERC在持有注释中的基线指标优于10至36个绝对的Pearson点。当我们评估最小对的鲁棒性时，LERC可以达到80％的精度，超过14至26个绝对百分点的基线，同时为改进留出了很大的改进空间。摩卡提出了一个具有挑战性的问题，用于开发准确，强大的生成阅读理解指标。

Posing reading comprehension as a generation problem provides a great deal of flexibility, allowing for open-ended questions with few restrictions on possible answers. However, progress is impeded by existing generation metrics, which rely on token overlap and are agnostic to the nuances of reading comprehension. To address this, we introduce a benchmark for training and evaluating generative reading comprehension metrics: MOdeling Correctness with Human Annotations. MOCHA contains 40K human judgement scores on model outputs from 6 diverse question answering datasets and an additional set of minimal pairs for evaluation. Using MOCHA, we train a Learned Evaluation metric for Reading Comprehension, LERC, to mimic human judgement scores. LERC outperforms baseline metrics by 10 to 36 absolute Pearson points on held-out annotations. When we evaluate robustness on minimal pairs, LERC achieves 80% accuracy, outperforming baselines by 14 to 26 absolute percentage points while leaving significant room for improvement. MOCHA presents a challenging problem for developing accurate and robust generative reading comprehension metrics.

下载PDF全文

下载文献需遵守相关版权规定

论文标题