论文标题

在文本摘要中重新评估评估

Re-evaluating Evaluation in Text Summarization

论文作者

Bhandari, Manik, Gour, Pranav, Ashfaq, Atabak, Liu, Pengfei, Neubig, Graham

论文摘要

自动评估指标作为手动评估的替身是文本生成任务(例如文本摘要)开发的重要组成部分。但是,尽管该领域的发展,但我们的标准指标尚未 - 近20年来,Rouge一直是大多数摘要论文的标准评估。在本文中,我们尝试重新评估文本摘要的评估方法:在最近流行的数据集中,使用最高分数的系统输出来评估自动指标的可靠性,以用于系统级别和摘要级别的评估设置。我们发现,关于旧数据集评​​估指标的结论不一定在现代数据集和系统上。

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not -- for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源