人类评估和与自动指标的相关性

论文标题

人类评估和与自动指标的相关性

Human Evaluation and Correlation with Automatic Metrics in Consultation Note Generation

论文作者

Moramarco, Francesco, Korfiatis, Alex Papadopoulos, Perera, Mark, Juric, Damir, Flann, Jack, Reiter, Ehud, Belz, Anya, Savkov, Aleksandar

论文摘要

近年来，机器学习模型在生成临床咨询笔记方面迅速变得更好。但是，如何正确评估生成的咨询说明，以了解他们使用它们对临床医生的影响以及患者的临床安全，几乎没有工作。为了解决这个问题，我们介绍了一项广泛的人类评估研究，对咨询说明进行了广泛的评估研究，其中5位临床医生（i）听57个模拟咨询，（ii）写下自己的笔记，（iii）编辑后的许多自动产生的注释，以及（iv）提取所有量化和定性的所有错误。然后，我们通过18个自动质量指标和人类判断进行了一项相关研究。我们发现，一个简单的，基于特征的Levenshtein距离度量标准在PAR上执行，即使不是Bertscore（例如BertScore）的常见指标。我们所有的发现和注释都是开源的。

In recent years, machine learning models have rapidly become better at generating clinical consultation notes; yet, there is little work on how to properly evaluate the generated consultation notes to understand the impact they may have on both the clinician using them and the patient's clinical safety. To address this we present an extensive human evaluation study of consultation notes where 5 clinicians (i) listen to 57 mock consultations, (ii) write their own notes, (iii) post-edit a number of automatically generated notes, and (iv) extract all the errors, both quantitative and qualitative. We then carry out a correlation study with 18 automatic quality metrics and the human judgements. We find that a simple, character-based Levenshtein distance metric performs on par if not better than common model-based metrics like BertScore. All our findings and annotations are open-sourced.

下载PDF全文

下载文献需遵守相关版权规定

论文标题