RDF数据集质量评估的可扩展框架

论文标题

RDF数据集质量评估的可扩展框架

A Scalable Framework for Quality Assessment of RDF Datasets

论文作者

Sejdiu, Gezim, Rula, Anisa, Lehmann, Jens, Jabeen, Hajira

论文摘要

在过去的几年中，链接的数据不断增长。今天，按照链接的数据标准，我们可以在线获得10,000多个数据集。这些标准允许数据可读且可相互处理。然而，如果数据集成，搜索和互相链接等许多应用程序，如果链接数据的质量低，则无法充分利用链接数据。有几种方法可以评估链接数据的质量评估，但是随着数据大小的增加和迅速增长，它们的性能降低了，超出了一台机器的功能。在本文中，我们介绍了疑难及以下的质量评估，对大型RDF数据集的质量评估进行了开源实现，该数据集可以扩展到一组机器。这是使用Apache Spark计算大型RDF数据集的不同质量指标的第一个分布式内存方法。我们还提供了一种质量评估模式，可用于生成可应用于大数据的新的可扩展指标。此处介绍的工作与SANSA框架集成在一起，并已应用于SANSA社区以外的至少三个用例。结果表明，与先前提出的方法相比，我们的方法更具通用，高效和可扩展性。

Over the last years, Linked Data has grown continuously. Today, we count more than 10,000 datasets being available online following Linked Data standards. These standards allow data to be machine readable and inter-operable. Nevertheless, many applications, such as data integration, search, and interlinking, cannot take full advantage of Linked Data if it is of low quality. There exist a few approaches for the quality assessment of Linked Data, but their performance degrades with the increase in data size and quickly grows beyond the capabilities of a single machine. In this paper, we present DistQualityAssessment -- an open source implementation of quality assessment of large RDF datasets that can scale out to a cluster of machines. This is the first distributed, in-memory approach for computing different quality metrics for large RDF datasets using Apache Spark. We also provide a quality assessment pattern that can be used to generate new scalable metrics that can be applied to big data. The work presented here is integrated with the SANSA framework and has been applied to at least three use cases beyond the SANSA community. The results show that our approach is more generic, efficient, and scalable as compared to previously proposed approaches.

下载PDF全文

下载文献需遵守相关版权规定

论文标题