论文标题
在具有不同的协变量信息的集成多核病分析中缺少数据插值
Missing data interpolation in integrative multi-cohort analysis with disparate covariate information
论文作者
论文摘要
多个队列生成的数据集的综合分析是一种广泛使用的方法,用于增加样本量,人口估计量的精度和分析的概括性,导致流行病学研究。但是,通常每个单独的队列数据集都没有所有感兴趣的变量作为原始研究的一部分收集的综合分析。这种队列级别的缺失对综合分析构成了方法论上的挑战,因为传统上缺少变量:(1)从数据中删除以进行完整的病例分析;或(2)使用与其他研究中具有相同协变量分布的数据缺少数据插值技术完成。在大多数综合分析研究中,这两种方法都不是最佳的,因为它会导致大多数研究协变量或挑战在指定同一分布之后的同伙时面临的挑战。我们提出了一种新型方法,以识别具有相同分布的研究,可用于完成队列级别丢失的信息。我们的方法依赖于(1)使用队列身份随机森林预测模型鉴定具有相似协变量分布的同类群体的亚组,然后是聚类。然后(2)将递归的成对分布测试用于这些子组的高维数据。广泛的模拟研究表明,具有相同分布的同类几乎在几乎所有仿真设置中都正确分组在一起。我们在两个回声范围的队列研究中应用方法的应用表明,组合在一起的同类群体反映了研究设计的相似性。这些方法是在R软件包中实现的。
Integrative analysis of datasets generated by multiple cohorts is a widely-used approach for increasing sample size, precision of population estimators, and generalizability of analysis results in epidemiological studies. However, often each individual cohort dataset does not have all variables of interest for an integrative analysis collected as a part of an original study. Such cohort-level missingness poses methodological challenges to the integrative analysis since missing variables have traditionally: (1) been removed from the data for complete case analysis; or (2) been completed by missing data interpolation techniques using data with the same covariate distribution from other studies. In most integrative-analysis studies, neither approach is optimal as it leads to either loosing the majority of study covariates or challenges in specifying the cohorts following the same distributions. We propose a novel approach to identify the studies with same distributions that could be used for completing the cohort-level missing information. Our methodology relies on (1) identifying sub-groups of cohorts with similar covariate distributions using cohort identity random forest prediction models followed by clustering; and then (2) applying a recursive pairwise distribution test for high dimensional data to these sub-groups. Extensive simulation studies show that cohorts with the same distribution are correctly grouped together in almost all simulation settings. Our methods' application to two ECHO-wide Cohort Studies reveals that the cohorts grouped together reflect the similarities in study design. The methods are implemented in R software package relate.