论文标题
信息太多杀死信息:聚类的观点
Too Much Information Kills Information: A Clustering Perspective
论文作者
论文摘要
聚类是人工智能领域中最基本的工具之一,尤其是在模式识别和学习理论中。在本文中,我们提出了一种简单但新颖的方法,用于基于方差的K群集任务,其中包括广为人知的K-均值聚类。所提出的方法从给定数据集中选择一个采样子集,并仅根据子集中的数据信息做出决策。通过某些假设,所产生的聚类证明是有益于估计具有很高概率的基于方差的目标的最佳。关于合成数据集和现实世界数据集的广泛实验表明,要获得与K-Means方法(Llyod 1982)和K-Means ++方法(Arthur and Vassilvitskii 2007)相比,获得竞争成果,我们只需要7%的数据集信息信息。如果我们有多达15%的数据集信息,那么在集群质量方面,我们的算法在至少80%的聚类任务中均优于K-Means方法和K-Means ++方法。同样,基于相同想法的扩展算法可以确保平衡的k群集结果。
Clustering is one of the most fundamental tools in the artificial intelligence area, particularly in the pattern recognition and learning theory. In this paper, we propose a simple, but novel approach for variance-based k-clustering tasks, included in which is the widely known k-means clustering. The proposed approach picks a sampling subset from the given dataset and makes decisions based on the data information in the subset only. With certain assumptions, the resulting clustering is provably good to estimate the optimum of the variance-based objective with high probability. Extensive experiments on synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod 1982) and k-means++ method (Arthur and Vassilvitskii 2007), we only need 7% information of the dataset. If we have up to 15% information of the dataset, then our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks, in terms of the quality of clustering. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.