论文标题
近似排序及其在I/O模型中的应用
Approximate sorting and its application in I/O model
论文作者
论文摘要
在本文中考虑了大数据的大概分类。大数据进行大致排序的目的是产生近似排序的结果,但使用CPU和I/O成本较少。对于大数据,我们考虑I/O模型中的近似排序。现有关于排列空间的指标不适合外部近似排序算法。因此,我们提出了一种名为外部度量的新型公制,它忽略了每个I/O块中发生的错误和错位。此外,为了促进对近似排序结果的更好评估,我们提出了一个新的指标,称为错误,该指标直接指出了元素的位错数。本文还考虑了其外部度量外部错误。然后,根据这两个指标赋予的利率分数关系,这两个指标的下限在外部近似分类问题上与T I/O操作有关。我们提出了一种称为EASORT的K-PASS外部近似排序算法,并证明EASORT在渐近上是最佳的。最后,我们考虑近似排序结果的应用。提出了我们近似排序结果的索引,并使用此索引对近似排序结果分析单个和范围查询。此外,本文讨论了两种关系的分类合并,其中一个关系是近似分类或两个关系分类的。
The approximate sorting for big data is considered in this paper. The goal of approximate sorting for big data is to generate an approximate sorted result, but using less CPU and I/O cost. For big data, we consider the approximate sorting in I/O model. The existing metrics on permutation space are not available for external approximate sorting algorithms. Thus, we propose a new kind of metric named External metric, which ignores the errors and dislocation that happened in each I/O block.The External Spearmans footrule metric is an example of external metric for Spearmans footrule metric. Furthermore, to facilitate a better evaluation of the approximate sorted result, we propose a new metric, named as errors, which directly states the number of dislocation of the elements. Its external metric external errors is also considered in this paper. Then, according to the rate-distortion relationship endowed by these two metrics, the lower bound of these two metrics on external approximate sorting problem with t I/O operations is proved. We propose a k-pass external approximate sorting algorithm, named as EASORT, and prove that EASORT is asymptotically optimal. Finally, we consider the applications on approximate sorting results. An index for the result of our approximate sorting is proposed and analyze the single and range query on approximate sorted result using this index. Further, the sort-merge join on two relations, where one of the relations is approximate sorted or both relations are approximate sorted, are all discussed in this paper.