通过极值理论的概率分布聚类

论文标题

通过极值理论的概率分布聚类

Clustering by the Probability Distributions from Extreme Value Theory

论文作者

Zheng, Sixiao, Fan, Ke, Hou, Yanxi, Feng, Jianfeng, Fu, Yanwei

论文摘要

聚类是无监督学习的重要任务。它试图自动将实例分为一致的子集。作为最著名的聚类算法之一，K均值将边界的样品点分配给独特的群集，而它不利用样本分布或密度的信息。相比之下，考虑可能群集中每个样品的概率可能会更有益。为此，本文概括了K-均值来对簇的分布进行建模。因此，我们的新型聚类算法在极值理论（EVT）中通过广义帕累托分布（GPD）在阈值上对质心的距离分布进行了建模。值得注意的是，我们提出了质心距离距离的概念，使用GPD为每个群集建立一个概率模型，并基于从GPD得出的覆盖概率函数执行聚类算法。因此，从概率的角度来看，这样的GPD K-均值可以使聚类算法从事聚类算法。相应地，我们还引入了一个幼稚的基线，称为广义极值（GEV）K-均值。 GEV适合块最大值的分布。相比之下，GPD拟合距离的分布超过足够大的阈值，从而导致GPD K-均值的性能更稳定。值得注意的是，GEV K-均值还可以估计群集结构，因此在经典的K-均值中表现出色。因此，对合成数据集和真实数据集进行的广泛实验表明，GPD K-均值优于竞争对手。 GitHub代码在https://github.com/sixiaozheng/evt-k-means中发布。

Clustering is an essential task to unsupervised learning. It tries to automatically separate instances into coherent subsets. As one of the most well-known clustering algorithms, k-means assigns sample points at the boundary to a unique cluster, while it does not utilize the information of sample distribution or density. Comparably, it would potentially be more beneficial to consider the probability of each sample in a possible cluster. To this end, this paper generalizes k-means to model the distribution of clusters. Our novel clustering algorithm thus models the distributions of distances to centroids over a threshold by Generalized Pareto Distribution (GPD) in Extreme Value Theory (EVT). Notably, we propose the concept of centroid margin distance, use GPD to establish a probability model for each cluster, and perform a clustering algorithm based on the covering probability function derived from GPD. Such a GPD k-means thus enables the clustering algorithm from the probabilistic perspective. Correspondingly, we also introduce a naive baseline, dubbed as Generalized Extreme Value (GEV) k-means. GEV fits the distribution of the block maxima. In contrast, the GPD fits the distribution of distance to the centroid exceeding a sufficiently large threshold, leading to a more stable performance of GPD k-means. Notably, GEV k-means can also estimate cluster structure and thus perform reasonably well over classical k-means. Thus, extensive experiments on synthetic datasets and real datasets demonstrate that GPD k-means outperforms competitors. The github codes are released in https://github.com/sixiaozheng/EVT-K-means.

下载PDF全文

下载文献需遵守相关版权规定

论文标题