论文标题
可伸缩的大型拉索:计数数据的双向稀疏网络推断
Scalable Bigraphical Lasso: Two-way Sparse Network Inference for Count Data
论文作者
论文摘要
从经典上讲,统计数据集比功能($ n> p $)具有更大的数据点。经典统计的标准模型可满足给定参数有条件地将数据点视为独立的情况。但是,对于$ n \左右,此类型号的确定很差。 Kalaitzis等。 (2013年)引入了Bigraphical Lasso,这是基于笛卡尔产物的稀疏精密矩阵的估计器。不幸的是,由于内存要求,原始的Bigraphical Lasso算法在大P和N的情况下不适用。我们利用笛卡尔产品图的特征值分解以呈现更有效的算法版本,该算法将内存要求从$ O(n^2p^2)$减少到$ O(n^2 + p^2)$。不同应用领域的许多数据集,例如生物学,医学和社会科学,都有计数数据,基于高斯的模型不适用。我们的多向网络推理方法可用于离散数据。我们的方法解释了两个实例和功能之间的依赖关系,可降低高维数据的计算复杂性,并使能够处理离散数据和连续数据。介绍了有关合成和真实数据集的数值研究,以展示我们方法的性能。
Classically, statistical datasets have a larger number of data points than features ($n > p$). The standard model of classical statistics caters for the case where data points are considered conditionally independent given the parameters. However, for $n\approx p$ or $p > n$ such models are poorly determined. Kalaitzis et al. (2013) introduced the Bigraphical Lasso, an estimator for sparse precision matrices based on the Cartesian product of graphs. Unfortunately, the original Bigraphical Lasso algorithm is not applicable in case of large p and n due to memory requirements. We exploit eigenvalue decomposition of the Cartesian product graph to present a more efficient version of the algorithm which reduces memory requirements from $O(n^2p^2)$ to $O(n^2 + p^2)$. Many datasets in different application fields, such as biology, medicine and social science, come with count data, for which Gaussian based models are not applicable. Our multi-way network inference approach can be used for discrete data. Our methodology accounts for the dependencies across both instances and features, reduces the computational complexity for high dimensional data and enables to deal with both discrete and continuous data. Numerical studies on both synthetic and real datasets are presented to showcase the performance of our method.