在大回归数据中选择特征选择的亚采样获奖者算法

论文标题

在大回归数据中选择特征选择的亚采样获奖者算法

Subsampling Winner Algorithm for Feature Selection in Large Regression Data

论文作者

Fan, Yiying, Sun, Jiayang

论文摘要

回归分析中许多协变量（又称特征）的特征选择仍然是数据科学中的挑战，尤其是在扩展到不断增强的数据并找到一组科学意义的特征方面。例如，要为卵巢癌开发新的反应性药物靶标，实际特征选择程序的实际错误发现率（FDR）也必须与目标FDR相匹配。当真实特征稀疏时，流行的特征选择方法是使用惩罚的可能性或收缩估计，例如套索，SCAD，弹性网或MCP程序（称为基准测试程序）。我们使用新的亚采样方法提出了一种不同的方法，称为亚采样冠军算法（SWA）。 SWA的核心思想类似于用于选择美国国家优异学者的概念。 SWA使用“基本过程”来分析每个子样本，根据所有子样本分析中的每个特征的性能计算所有特征的得分，根据结果得分获得“半决赛主义”，然后确定“决赛选手”，即最重要的特征。由于其亚采样性质，SWA可以原则上的数据扩展到任何维度的数据。与基准程序和随机前景相比，SWA还具有最佳控制的实际FDR，同时具有具有竞争力的真实发现率。我们还建议采用有或没有受到惩罚基准程序的SWA的实用附加策略，以进一步确保“真实”发现的机会。我们将SWA应用于广泛研究所的卵巢浆液性膀胱癌标本中揭示了功能上重要的基因和途径，我们通过其他基因组学工具验证了这一基因和途径。这项第二阶段的调查对于当前关于正确使用P值的讨论至关重要。

Feature selection from a large number of covariates (aka features) in a regression analysis remains a challenge in data science, especially in terms of its potential of scaling to ever-enlarging data and finding a group of scientifically meaningful features. For example, to develop new, responsive drug targets for ovarian cancer, the actual false discovery rate (FDR) of a practical feature selection procedure must also match the target FDR. The popular approach to feature selection, when true features are sparse, is to use a penalized likelihood or a shrinkage estimation, such as a LASSO, SCAD, Elastic Net, or MCP procedure (call them benchmark procedures). We present a different approach using a new subsampling method, called the Subsampling Winner algorithm (SWA). The central idea of SWA is analogous to that used for the selection of US national merit scholars. SWA uses a "base procedure" to analyze each of the subsamples, computes the scores of all features according to the performance of each feature from all subsample analyses, obtains the "semifinalist" based on the resulting scores, and then determines the "finalists," i.e., the most important features. Due to its subsampling nature, SWA can scale to data of any dimension in principle. The SWA also has the best-controlled actual FDR in comparison with the benchmark procedures and the randomForest, while having a competitive true-feature discovery rate. We also suggest practical add-on strategies to SWA with or without a penalized benchmark procedure to further assure the chance of "true" discovery. Our application of SWA to the ovarian serous cystadenocarcinoma specimens from the Broad Institute revealed functionally important genes and pathways, which we verified by additional genomics tools. This second-stage investigation is essential in the current discussion of the proper use of P-values.

下载PDF全文

下载文献需遵守相关版权规定

论文标题