从信息检索角度重新思考知识图完成评估

论文标题

从信息检索角度重新思考知识图完成评估

Re-thinking Knowledge Graph Completion Evaluation from an Information Retrieval Perspective

论文作者

Zhou, Ying, Chen, Xuanang, He, Ben, Ye, Zheng, Sun, Le

论文摘要

知识图完成（KGC）旨在根据知识图中的已知事实来推断缺失的知识三元。当前的kgc研究主要遵循实体排名协议，其中有效性是由测试三重胶片中的蒙版实体的预测等级来衡量的。然后，对所有单个答案实体的微观（平均）度量标准给出了整体性能。由于大规模知识库的不完整性质，这种实体排名设置可能受到未标记的顶级积极示例的影响，因此提出了有关当前评估协议是否足以保证KGC系统的公平比较的问题。为此，本文介绍了一项系统的研究，介绍了标签稀疏性是否以及如何用流行的微观指标影响当前的kgc评估。具体而言，灵感来自TREC范式用于大规模信息检索（IR）实验，我们根据TREC PORING方法在流行的FB15K-237数据集中创建了一个相对“完整”的判断集。根据我们的分析，令人惊讶的是，从原始标签转换为我们的“完整”标签会导致系统排名在微观指标方面的各种13个流行kgc模型的急剧变化。进一步的研究表明，在不同的设置下，类似IR样的宏观（ - 平均）指标更稳定，更歧视，同时受到标签稀疏性的影响较小。因此，对于KGC评估，我们建议进行TREC风格的合并以在人类努力和标签完整性之间取得平衡，并报告类似IR的宏观指标，以反映KGC任务的排名。

Knowledge graph completion (KGC) aims to infer missing knowledge triples based on known facts in a knowledge graph. Current KGC research mostly follows an entity ranking protocol, wherein the effectiveness is measured by the predicted rank of a masked entity in a test triple. The overall performance is then given by a micro(-average) metric over all individual answer entities. Due to the incomplete nature of the large-scale knowledge bases, such an entity ranking setting is likely affected by unlabelled top-ranked positive examples, raising questions on whether the current evaluation protocol is sufficient to guarantee a fair comparison of KGC systems. To this end, this paper presents a systematic study on whether and how the label sparsity affects the current KGC evaluation with the popular micro metrics. Specifically, inspired by the TREC paradigm for large-scale information retrieval (IR) experimentation, we create a relatively "complete" judgment set based on a sample from the popular FB15k-237 dataset following the TREC pooling method. According to our analysis, it comes as a surprise that switching from the original labels to our "complete" labels results in a drastic change of system ranking of a variety of 13 popular KGC models in terms of micro metrics. Further investigation indicates that the IR-like macro(-average) metrics are more stable and discriminative under different settings, meanwhile, less affected by label sparsity. Thus, for KGC evaluation, we recommend conducting TREC-style pooling to balance between human efforts and label completeness, and reporting also the IR-like macro metrics to reflect the ranking nature of the KGC task.

下载PDF全文

下载文献需遵守相关版权规定

论文标题