最佳协变量增加了高通量生物学的发现

论文标题

最佳协变量增加了高通量生物学的发现

Optimal Covariate Weighting Increases Discoveries in High-throughput Biology

论文作者

Hasan, Mohamad, Schliekelman, Paul

论文摘要

高吞吐物生物学数据固有的大规模多重测试需要很高的统计严格，因此，除非具有高效应大小，否则很难检测到数据的真实效果。减轻多重测试负担的一种有希望的方法是使用独立的信息来确定最有可能是真实效果的功能。但是，有效地使用独立数据是具有挑战性的，并且通常不会带来大量的权力增长。当前的最新方法通过独立信息将特征分为组，并计算每个组的权重。但是，当真实效应较弱且罕见（高吞吐量生物学研究的典型情况）时，所有组都将包含许多无效测试，因此其权重稀释，性能会受到损失。我们介绍协变量加权（CRW），这是一种计算外部协变量基于测试排名的近似最佳权重的方法。该方法使用协变量排名和测试效果大小之间的概率关系来计算每个测试的单个权重，这些权重比组重量更有信息，并且不会因无效效应而稀释。我们展示了如何根据正态分布的协变量在理论上计算这种关系。在其他情况下，可以从经验上进行估计。我们通过模拟和数据显示，在稀有/低效应大小的方案中，这种方法的表现优于现有方法多达10倍，并且在所有方案中至少具有可比性的性能。

The large-scale multiple testing inherent to high throughput biological data necessitates very high statistical stringency and thus true effects in data are difficult to detect unless they have high effect sizes. One promising approach for reducing the multiple testing burden is to use independent information to prioritize the features most likely to be true effects. However, using the independent data effectively is challenging and often does not lead to substantial gains in power. Current state-of-the-art methods sort features into groups by the independent information and calculate weights for each group. However, when true effects are weak and rare (the typical situation for high throughput biological studies), all groups will contain many null tests and thus their weights are diluted, and performance suffers. We introduce Covariate Rank Weighting (CRW), a method for calculating approximate optimal weights conditioned on the ranking of tests by an external covariate. This approach uses the probabilistic relationship between covariate ranking and test effect size to calculate individual weights for each test that are more informative than group weights and are not diluted by null effects. We show how this relationship can be calculated theoretically for normally distributed covariates. It can be estimated empirically in other cases. We show via simulations and applications to data that this method outperforms existing methods by as much as 10-fold in the rare/low effect size scenario common to biological data and has at least comparable performance in all scenarios.

下载PDF全文

下载文献需遵守相关版权规定

论文标题