论文标题

低级混合物模型的Lloyd算法的最佳聚类

Optimal Clustering by Lloyd Algorithm for Low-Rank Mixture Model

论文作者

Lyu, Zhongyuan, Xia, Dong

论文摘要

本文研究了聚类基质值观测值的计算和统计限制。我们提出了一个低级别的混合模型(LRMM),该模型适用于经典的高斯混合模型(GMM)来处理基质值观测值,该观测值假设人口中心矩阵的低级别。通过集成Lloyd的算法和低级别近似,设计了一种计算有效的聚类方法。一旦定位良好,该算法将快速收敛并实现最小值最小值的指数型聚类错误率。同时,我们表明一种基于张量的光谱方法提供了良好的初始聚类。最小值最佳聚类错误率与GMM相当,由分离强度(即种群中心矩阵之间的最小距离)决定。通过利用低级度,提出的算法对分离强度的要求较弱。但是,与GMM不同,LRMM的计算难度的特征在于信号强度,即,人口中心矩阵最小的非零奇异值。提供的证据表明,即使信号强度不够强,即使分离强度很强,也没有多项式时间算法是一致的。讨论了LRMM下估计和聚类之间的有趣差异。综合模拟实验证实了低级劳埃德算法的优点。最后,我们的方法在现实世界数据集的文献中优于其他方法。

This paper investigates the computational and statistical limits in clustering matrix-valued observations. We propose a low-rank mixture model (LrMM), adapted from the classical Gaussian mixture model (GMM) to treat matrix-valued observations, which assumes low-rankness for population center matrices. A computationally efficient clustering method is designed by integrating Lloyd's algorithm and low-rank approximation. Once well-initialized, the algorithm converges fast and achieves an exponential-type clustering error rate that is minimax optimal. Meanwhile, we show that a tensor-based spectral method delivers a good initial clustering. Comparable to GMM, the minimax optimal clustering error rate is decided by the separation strength, i.e., the minimal distance between population center matrices. By exploiting low-rankness, the proposed algorithm is blessed with a weaker requirement on the separation strength. Unlike GMM, however, the computational difficulty of LrMM is characterized by the signal strength, i.e., the smallest non-zero singular values of population center matrices. Evidence is provided showing that no polynomial-time algorithm is consistent if the signal strength is not strong enough, even though the separation strength is strong. Intriguing differences between estimation and clustering under LrMM are discussed. The merits of low-rank Lloyd's algorithm are confirmed by comprehensive simulation experiments. Finally, our method outperforms others in the literature on real-world datasets.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源