论文标题

使用负二项式非负矩阵分解对突变特征的模型选择和鲁棒推理

Model selection and robust inference of mutational signatures using Negative Binomial non-negative matrix factorization

论文作者

Pelizzola, Marta, Laursen, Ragnhild, Hobolth, Asger

论文摘要

癌症基因组集合中突变的光谱可以通过一些突变特征的混合物来描述。可以使用非负基质分解(NMF)发现突变特征。为了提取突变特征,我们必须假设观察到的突变计数和许多突变特征的分布。在大多数应用中,假定突变计数是泊松分布的,并且通过比较具有相同基础分布的几个模型的拟合来选择等级,并且使用经典模型选择过程比较级别的级别。但是,计数通常被过度分散,因此负二项式分布更合适。我们提出了具有患者特定分散参数的负二项式NMF,以捕获患者之间的变化。我们还引入了一种新型的模型选择程序,该过程灵感来自交叉验证,以确定标志的数量。使用仿真,我们研究了分布假设对我们方法的影响以及其他经典模型选择程序,我们表明我们的模型选择程序在确定模型错误指定下正确数量的签名数方面更有稳定。我们还表明,我们的模型选择程序比查找真实签名数的最新方法更准确。其他方法高度高估了存在过度分散时的签名数量。我们对广泛的模拟数据以及乳腺癌和前列腺癌患者的两个真实数据集应用了我们提出的分析。我们的模型选择过程和负二项式NMF的代码可在R软件包SIGMOS中找到,可在https://github.com/martapelizzola/sigmos上找到。

The spectrum of mutations in a collection of cancer genomes can be described by a mixture of a few mutational signatures. The mutational signatures can be found using non-negative matrix factorization (NMF). To extract the mutational signatures we have to assume a distribution for the observed mutational counts and a number of mutational signatures. In most applications, the mutational counts are assumed to be Poisson distributed, and the rank is chosen by comparing the fit of several models with the same underlying distribution and different values for the rank using classical model selection procedures. However, the counts are often overdispersed, and thus the Negative Binomial distribution is more appropriate. We propose a Negative Binomial NMF with a patient specific dispersion parameter to capture the variation across patients. We also introduce a novel model selection procedure inspired by cross-validation to determine the number of signatures. Using simulations, we study the influence of the distributional assumption on our method together with other classical model selection procedures and we show that our model selection procedure is more robust at determining the correct number of signatures under model misspecification. We also show that our model selection procedure is more accurate than state-of-the-art methods for finding the true number of signatures. Other methods are highly overestimating the number of signatures when overdispersion is present. We apply our proposed analysis on a wide range of simulated data and on two real data sets from breast and prostate cancer patients. The code for our model selection procedure and negative binomial NMF is available in the R package SigMoS and can be found at https://github.com/MartaPelizzola/SigMoS.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源