论文标题
通过极值理论的概率分布聚类
Clustering by the Probability Distributions from Extreme Value Theory
论文作者
论文摘要
聚类是无监督学习的重要任务。它试图自动将实例分为一致的子集。作为最著名的聚类算法之一,K均值将边界的样品点分配给独特的群集,而它不利用样本分布或密度的信息。相比之下,考虑可能群集中每个样品的概率可能会更有益。为此,本文概括了K-均值来对簇的分布进行建模。因此,我们的新型聚类算法在极值理论(EVT)中通过广义帕累托分布(GPD)在阈值上对质心的距离分布进行了建模。值得注意的是,我们提出了质心距离距离的概念,使用GPD为每个群集建立一个概率模型,并基于从GPD得出的覆盖概率函数执行聚类算法。因此,从概率的角度来看,这样的GPD K-均值可以使聚类算法从事聚类算法。相应地,我们还引入了一个幼稚的基线,称为广义极值(GEV)K-均值。 GEV适合块最大值的分布。相比之下,GPD拟合距离的分布超过足够大的阈值,从而导致GPD K-均值的性能更稳定。值得注意的是,GEV K-均值还可以估计群集结构,因此在经典的K-均值中表现出色。因此,对合成数据集和真实数据集进行的广泛实验表明,GPD K-均值优于竞争对手。 GitHub代码在https://github.com/sixiaozheng/evt-k-means中发布。
Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.