论文标题
从信息检索角度重新思考知识图完成评估
Re-thinking Knowledge Graph Completion Evaluation from an Information Retrieval Perspective
论文作者
论文摘要
知识图完成(KGC)旨在根据知识图中的已知事实来推断缺失的知识三元。当前的kgc研究主要遵循实体排名协议,其中有效性是由测试三重胶片中的蒙版实体的预测等级来衡量的。然后,对所有单个答案实体的微观(平均)度量标准给出了整体性能。由于大规模知识库的不完整性质,这种实体排名设置可能受到未标记的顶级积极示例的影响,因此提出了有关当前评估协议是否足以保证KGC系统的公平比较的问题。为此,本文介绍了一项系统的研究,介绍了标签稀疏性是否以及如何用流行的微观指标影响当前的kgc评估。具体而言,灵感来自TREC范式用于大规模信息检索(IR)实验,我们根据TREC PORING方法在流行的FB15K-237数据集中创建了一个相对“完整”的判断集。根据我们的分析,令人惊讶的是,从原始标签转换为我们的“完整”标签会导致系统排名在微观指标方面的各种13个流行kgc模型的急剧变化。进一步的研究表明,在不同的设置下,类似IR样的宏观( - 平均)指标更稳定,更歧视,同时受到标签稀疏性的影响较小。因此,对于KGC评估,我们建议进行TREC风格的合并以在人类努力和标签完整性之间取得平衡,并报告类似IR的宏观指标,以反映KGC任务的排名。
Knowledge graph completion (KGC) aims to infer missing knowledge triples based on known facts in a knowledge graph. Current KGC research mostly follows an entity ranking protocol, wherein the effectiveness is measured by the predicted rank of a masked entity in a test triple. The overall performance is then given by a micro(-average) metric over all individual answer entities. Due to the incomplete nature of the large-scale knowledge bases, such an entity ranking setting is likely affected by unlabelled top-ranked positive examples, raising questions on whether the current evaluation protocol is sufficient to guarantee a fair comparison of KGC systems. To this end, this paper presents a systematic study on whether and how the label sparsity affects the current KGC evaluation with the popular micro metrics. Specifically, inspired by the TREC paradigm for large-scale information retrieval (IR) experimentation, we create a relatively "complete" judgment set based on a sample from the popular FB15k-237 dataset following the TREC pooling method. According to our analysis, it comes as a surprise that switching from the original labels to our "complete" labels results in a drastic change of system ranking of a variety of 13 popular KGC models in terms of micro metrics. Further investigation indicates that the IR-like macro(-average) metrics are more stable and discriminative under different settings, meanwhile, less affected by label sparsity. Thus, for KGC evaluation, we recommend conducting TREC-style pooling to balance between human efforts and label completeness, and reporting also the IR-like macro metrics to reflect the ranking nature of the KGC task.