自然语言生成自动评估的玻璃天花板

论文标题

自然语言生成自动评估的玻璃天花板

The Glass Ceiling of Automatic Evaluation in Natural Language Generation

论文作者

Colombo, Pierre, Peyrard, Maxime, Noiry, Nathan, West, Robert, Piantanida, Pablo

论文摘要

能够替换人类判断的自动评估指标对于允许快速开发新方法至关重要。因此，许多研究工作集中在制定此类指标上。在这项工作中，我们退后一步，通过比较现有的自动指标和人类指标的身体来分析最近的进度。由于指标是根据它们的排名系统的使用方式，因此我们比较系统排名空间中的指标。我们广泛的统计分析揭示了令人惊讶的发现：自动指标 - 新老 - 与人类更相似。自动指标不是互补的，等级系统也类似。令人惊讶的是，人类指标彼此相互预测要比所有用于预测人类度量的自动指标的组合要好得多。令人惊讶的是，人类指标通常被设计为独立，以捕获质量的不同方面，例如内容保真度或可读性。我们对评估领域的未来工作进行了讨论。

Automatic evaluation metrics capable of replacing human judgments are critical to allowing fast development of new methods. Thus, numerous research efforts have focused on crafting such metrics. In this work, we take a step back and analyze recent progress by comparing the body of existing automatic metrics and human metrics altogether. As metrics are used based on how they rank systems, we compare metrics in the space of system rankings. Our extensive statistical analysis reveals surprising findings: automatic metrics -- old and new -- are much more similar to each other than to humans. Automatic metrics are not complementary and rank systems similarly. Strikingly, human metrics predict each other much better than the combination of all automatic metrics used to predict a human metric. It is surprising because human metrics are often designed to be independent, to capture different aspects of quality, e.g. content fidelity or readability. We provide a discussion of these findings and recommendations for future work in the field of evaluation.

下载PDF全文

下载文献需遵守相关版权规定

论文标题