论文标题
马达加斯加疟疾发病率的时空风险映射的非参数因果特征选择
Nonparametric Causal Feature Selection for Spatiotemporal Risk Mapping of Malaria Incidence in Madagascar
论文作者
论文摘要
现代疾病映射借鉴了大量的高分辨率空间数据产品,这些空间数据产品反映了环境和/或社会经济因素作为协变量或“特征”,以改善疾病风险的预测。特征选择是构建这些模型,有助于减少过度拟合和计算复杂性并提高模型解释性的重要一步。仅选择与响应变量有因果关系的特征可以潜在地提高预测和普遍性,但是从非相互惯用的时空数据中识别这些因果特征是一个具有挑战性的问题。在这里,我们研究了有关估计马达加斯加疟疾发病率的因果特征选择程序的表现。为此任务设计的研究程序将PC算法与时空预先定性和基于内核的独立性测试相结合,以适应汇总数据。该案例研究揭示了因素估计任务中的样本外预测准确性,而在时空插值任务中,因果特征选择具有明显的优势,而与线性和非线性回归模型相比,在时空插值任务中却没有。与无特征选择相比,在相对于模型复杂性相对于可用数据的体积较低的设置,因果特征选择最有益。
Modern disease mapping draws upon a wealth of high resolution spatial data products reflecting environmental and/or socioeconomic factors as covariates, or `features', within a geostatistical framework to improve predictions of disease risk. Feature selection is an important step in building these models, helping to reduce overfitting and computational complexity, and to improve model interpretability. Selecting only features that have a causal relationship with the response variable could potentially improve predictions and generalisability, but identifying these causal features from non-interventional, spatiotemporal data is a challenging problem. Here we examine the performance of a causal feature selection procedure with regard to estimating malaria incidence in Madagascar. The studied procedure designed for this task combines the PC algorithm with spatiotemporal prewhitening and kernel-based independence tests extended to accommodate aggregated data. This case study reveals a clear advantage for causal feature selection in terms of the out-of-sample predictive accuracy in a forward temporal estimation task, but not in a spatiotemporal interpolation task, in comparison with thresholded spike-and-slab, for both linear and non-linear regression models. Compared to no feature selection, causal feature selection was most beneficial in settings wherein the volume of available data was low relative to the model complexity.