开放知识基础规范化的多视图聚类

论文标题

开放知识基础规范化的多视图聚类

Multi-View Clustering for Open Knowledge Base Canonicalization

论文作者

Shen, Wei, Yang, Yang, Liu, Yinan

论文摘要

开放信息提取（OIE）方法从非结构化文本中提取大量的OIE三元<名词短语，关系短语，名词短语>，它们组成了大型开放知识基础（OKB）。此类OKB中的名词短语和关系短语不是规范化的，这导致了散落和冗余的事实。发现知识的两种观点（即，基于事实三重和基于事实三重源上下文的上下文视图的事实视图）提供了互补信息，这对于OKB规范化的任务至关重要，该信息将同义词的同义名词短语和关系短语和与之相关的相关短语归为同一组，并将其分配给它们独特的标识。但是，到目前为止，这两种知识的观点已被现有作品孤立地利用。在本文中，我们提出了CMVC，这是一个新颖的无监督框架，它利用这两种知识的观点共同使规范化OKB，而无需手动注释的标签。为了实现这一目标，我们提出了一种多视图CH K均值聚类算法，以相互加强通过考虑其不同的聚类质量从每个视图中学到的特定视图嵌入的聚类。为了进一步提高规范化的性能，我们在每个特定视图中分别提出了一个培训数据优化策略，以迭代方式完善学习视图的特定嵌入。此外，我们提出了一种对数跳跃算法，以数据驱动的方式预测簇数的最佳数量，而无需任何标签。我们通过针对最新方法进行了多个现实世界的OKB数据集的广泛实验来证明我们的框架的优势。

Open information extraction (OIE) methods extract plenty of OIE triples <noun phrase, relation phrase, noun phrase> from unstructured text, which compose large open knowledge bases (OKBs). Noun phrases and relation phrases in such OKBs are not canonicalized, which leads to scattered and redundant facts. It is found that two views of knowledge (i.e., a fact view based on the fact triple and a context view based on the fact triple's source context) provide complementary information that is vital to the task of OKB canonicalization, which clusters synonymous noun phrases and relation phrases into the same group and assigns them unique identifiers. However, these two views of knowledge have so far been leveraged in isolation by existing works. In this paper, we propose CMVC, a novel unsupervised framework that leverages these two views of knowledge jointly for canonicalizing OKBs without the need of manually annotated labels. To achieve this goal, we propose a multi-view CH K-Means clustering algorithm to mutually reinforce the clustering of view-specific embeddings learned from each view by considering their different clustering qualities. In order to further enhance the canonicalization performance, we propose a training data optimization strategy in terms of data quantity and data quality respectively in each particular view to refine the learned view-specific embeddings in an iterative manner. Additionally, we propose a Log-Jump algorithm to predict the optimal number of clusters in a data-driven way without requiring any labels. We demonstrate the superiority of our framework through extensive experiments on multiple real-world OKB data sets against state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题