FEQA：一个问题回答抽象性摘要中忠诚评估评估框架的问题

论文标题

FEQA：一个问题回答抽象性摘要中忠诚评估评估框架的问题

FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

论文作者

Durmus, Esin, He, He, Diab, Mona

论文摘要

神经抽象的摘要模型容易产生与源文档不一致的内容，即不忠。现有的自动指标不能有效捕获此类错误。我们解决了鉴于其源文档的评估忠诚度的问题。我们首先收集了两个数据集中众多模型产出的忠诚注释。我们发现，当前模型在抽象和忠诚之间表现出权衡：与源文档的单词重叠的输出更可能是不忠实的。接下来，我们提出一个自动答案（QA）的忠实指标，FEQA，该指标利用了最新的阅读理解进展。鉴于摘要产生的问题 - 答案对，质量检查模型从文档中提取答案。不匹配的答案表示摘要中的不忠信息。在基于单词重叠，嵌入相似性和学习语言理解模型的指标中，我们基于质量检查的指标与人类忠诚度得分的相关性明显更高，尤其是在高度抽象的摘要上。

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

下载PDF全文

下载文献需遵守相关版权规定

论文标题