来自多个数据集的结构化高维数据的综合学习

论文标题

来自多个数据集的结构化高维数据的综合学习

Integrative Learning of Structured High-Dimensional Data from Multiple Datasets

论文作者

Chang, Changgee, Dai, Zongyu, Oh, Jihwan, Long, Qi

论文摘要

多个数据集的综合学习有可能减轻小型$ n $和大$ p $的挑战，这些挑战经常在分析大型生物医学数据（例如基因组学数据）时遇到。通过共同选择所有数据集的功能，可以增强检测弱但重要的信号。但是，在所有数据集中，重要功能的集合可能并不总是相同。尽管某些现有的综合学习方法允许异质的稀疏结构，其中数据集的子集可能具有某些选定特征的零系数，但它们倾向于降低效率，从而恢复失去弱重要信号的问题。我们提出了一种新的综合学习方法，该方法不仅可以很好地汇总均匀稀疏结构中的重要信号，而且还可以大大减轻在异质稀疏结构中失去弱重要信号的问题。我们的方法利用了特征的先验图形结构，并鼓励图中连接的特征的关节选择。在多个数据集上集成此类信息可以增强功率，同时还考虑了跨数据集的异质性。研究了所提出方法的理论特性。我们还使用仿真研究和分析ADNI的基因表达数据来证明现有方法的局限性以及我们方法的优越性。

Integrative learning of multiple datasets has the potential to mitigate the challenge of small $n$ and large $p$ that is often encountered in analysis of big biomedical data such as genomics data. Detection of weak yet important signals can be enhanced by jointly selecting features for all datasets. However, the set of important features may not always be the same across all datasets. Although some existing integrative learning methods allow heterogeneous sparsity structure where a subset of datasets can have zero coefficients for some selected features, they tend to yield reduced efficiency, reinstating the problem of losing weak important signals. We propose a new integrative learning approach which can not only aggregate important signals well in homogeneous sparsity structure, but also substantially alleviate the problem of losing weak important signals in heterogeneous sparsity structure. Our approach exploits a priori known graphical structure of features and encourages joint selection of features that are connected in the graph. Integrating such prior information over multiple datasets enhances the power, while also accounting for the heterogeneity across datasets. Theoretical properties of the proposed method are investigated. We also demonstrate the limitations of existing approaches and the superiority of our method using a simulation study and analysis of gene expression data from ADNI.

下载PDF全文

下载文献需遵守相关版权规定

论文标题