在文本摘要中重新评估评估

论文标题

在文本摘要中重新评估评估

Re-evaluating Evaluation in Text Summarization

论文作者

Bhandari, Manik, Gour, Pranav, Ashfaq, Atabak, Liu, Pengfei, Neubig, Graham

论文摘要

自动评估指标作为手动评估的替身是文本生成任务（例如文本摘要）开发的重要组成部分。但是，尽管该领域的发展，但我们的标准指标尚未 - 近20年来，Rouge一直是大多数摘要论文的标准评估。在本文中，我们尝试重新评估文本摘要的评估方法：在最近流行的数据集中，使用最高分数的系统输出来评估自动指标的可靠性，以用于系统级别和摘要级别的评估设置。我们发现，关于旧数据集评估指标的结论不一定在现代数据集和系统上。

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

下载PDF全文

下载文献需遵守相关版权规定

论文标题