论文标题
倒福k-均值聚类:绩效分析
Inverted-File k-Means Clustering: Performance Analysis
论文作者
论文摘要
本文介绍了一种倒置的k-均值聚类算法(IVF),适用于具有许多类别的大规模稀疏数据集。鉴于这样的数据集,IVF在高速和低内存消耗中有效工作,这与标准劳埃德的算法保持相同的解决方案。高性能来自两个不同的数据表示。一个是对象和平均特征向量的稀疏表达式。另一个是一组平均特征向量的倒文件数据结构。为了确认这些表示的效果,我们使用不同的数据结构和表达式设计三种算法进行比较。我们在实验上证明,IVF的性能要比设计的算法更好地应用于配备超级量表外处理器和深层层次内存系统的现代计算机系统中的大规模真实文档数据集。我们还为速度绩效分析引入了一个简单而实用的时钟周期(CPI)模型。分析结果表明,IVF抑制了三个绩效降低因素:缓存误差,分支错误预测和完成的指示的数量。
This paper presents an inverted-file k-means clustering algorithm (IVF) suitable for a large-scale sparse data set with potentially numerous classes. Given such a data set, IVF efficiently works at high-speed and with low memory consumption, which keeps the same solution as a standard Lloyd's algorithm. The high performance arises from two distinct data representations. One is a sparse expression for both the object and mean feature vectors. The other is an inverted-file data structure for a set of the mean feature vectors. To confirm the effect of these representations, we design three algorithms using distinct data structures and expressions for comparison. We experimentally demonstrate that IVF achieves better performance than the designed algorithms when they are applied to large-scale real document data sets in a modern computer system equipped with superscalar out-of-order processors and a deep hierarchical memory system. We also introduce a simple yet practical clock-cycle per instruction (CPI) model for speed-performance analysis. Analytical results reveal that IVF suppresses three performance degradation factors: the numbers of cache misses, branch mispredictions, and the completed instructions.