论文标题
与正常混合模型的强大聚类:伪$β$ - likelihood方法
Robust Clustering with Normal Mixture Models: A Pseudo $β$-Likelihood Approach
论文作者
论文摘要
与其他估计场景一样,基于正常混合物设置的可能性估计是对模型错误指定和异常值的存在(除了是一个不适合的优化问题)。提出了对此估计问题的普通可能性方法的强大替代方法,该方法提出了同时进行估计和数据聚类,并导致随后的异常检测。为了调用鲁棒性,基于密度差异最小化的方法(或者,或者,在适当的约束下使用$β$ - likelihoody的最大化)。为了计算组件均值(或等效聚类中心)和组件分散矩阵,已遵循迭代重新加权的最小二乘方法,以同时进行数据群集。还建议一些探索性技术用于异常检测,这是统计和机器学习领域中非常重要的问题。通过不同设置下的模拟研究对所提出的方法进行了验证。与流行的现有方法相比,它的性能或更高的性能(例如K-Medoids,Tclust,修剪K-均值和McLust),尤其是当混合组件(即簇)共享具有显着重叠或偏远簇的区域时,具有显着的重叠或偏远的群集存在,具有小但不可辨别的权重(尤其是较高的较高的较高的较高的较高的较高的重量)。与其他数据集相比,还使用了两个真实的数据集来说明新提出的方法的性能以及图像处理中的应用程序。所提出的方法检测出较低分类速率的簇,并成功地指出了这些数据集的外围(异常)观察结果。
As in other estimation scenarios, likelihood based estimation in the normal mixture set-up is highly non-robust against model misspecification and presence of outliers (apart from being an ill-posed optimization problem). A robust alternative to the ordinary likelihood approach for this estimation problem is proposed which performs simultaneous estimation and data clustering and leads to subsequent anomaly detection. To invoke robustness, the methodology based on the minimization of the density power divergence (or alternatively, the maximization of the $β$-likelihood) is utilized under suitable constraints. An iteratively reweighted least squares approach has been followed in order to compute the proposed estimators for the component means (or equivalently cluster centers) and component dispersion matrices which leads to simultaneous data clustering. Some exploratory techniques are also suggested for anomaly detection, a problem of great importance in the domain of statistics and machine learning. The proposed method is validated with simulation studies under different set-ups; it performs competitively or better compared to the popular existing methods like K-medoids, TCLUST, trimmed K-means and MCLUST, especially when the mixture components (i.e., the clusters) share regions with significant overlap or outlying clusters exist with small but non-negligible weights (particularly in higher dimensions). Two real datasets are also used to illustrate the performance of the newly proposed method in comparison with others along with an application in image processing. The proposed method detects the clusters with lower misclassification rates and successfully points out the outlying (anomalous) observations from these datasets.