论文标题
一个混合模型的家族,用于双片
A Family of Mixture Models for Biclustering
论文作者
论文摘要
当没有已知组结构\ textIt {a先验}时,双簇用于同时聚类观测值和变量。它越来越多地用于生物信息学,文本分析等。以前,通过使用类似于因子分析仪的混合物的结构,在基于模型的聚类框架中引入了双簇。在此类模型中,观察到的变量$ \ mathbf {x} $是使用潜在变量$ \ mathbf {u} $建模的,假定为$ n(\ mathbf {0},\ mathbf {i})$。通过对因子加载矩阵的条目施加约束为0和1的条目来引入变量的聚类,从而导致对角线协方差矩阵。但是,这种方法过于限制,因为协方差矩阵块中的偏高元素只能是1,这可能导致模型不满意的模型适合复杂数据。在这里,假定潜在变量$ \ mathbf {u} $来自$ n(\ mathbf {0},\ mathbf {t})$,其中$ \ mathbf {t} $是斜矩阵。这样可以确保协方差矩阵中的块矩阵中的非对抗项非零,而不是限制为1。这导致了较高的模型适合复杂数据。通过对协方差矩阵的组成部分施加约束来开发一个模型家族。对于参数估计,使用了交替的期望条件最大化(AECM)算法。最后,使用模拟和真实数据集说明了提出的方法。
Biclustering is used for simultaneous clustering of the observations and variables when there is no group structure known \textit{a priori}. It is being increasingly used in bioinformatics, text analytics, etc. Previously, biclustering has been introduced in a model-based clustering framework by utilizing a structure similar to a mixture of factor analyzers. In such models, observed variables $\mathbf{X}$ are modelled using a latent variable $\mathbf{U}$ that is assumed to be from $N(\mathbf{0}, \mathbf{I})$. Clustering of variables is introduced by imposing constraints on the entries of the factor loading matrix to be 0 and 1 that results in a block diagonal covariance matrices. However, this approach is overly restrictive as off-diagonal elements in the blocks of the covariance matrices can only be 1 which can lead to unsatisfactory model fit on complex data. Here, the latent variable $\mathbf{U}$ is assumed to be from a $N(\mathbf{0}, \mathbf{T})$ where $\mathbf{T}$ is a diagonal matrix. This ensures that the off-diagonal terms in the block matrices within the covariance matrices are non-zero and not restricted to be 1. This leads to a superior model fit on complex data. A family of models are developed by imposing constraints on the components of the covariance matrix. For parameter estimation, an alternating expectation conditional maximization (AECM) algorithm is used. Finally, the proposed method is illustrated using simulated and real datasets.