论文标题
使用监督机器学习的观察科学的因果发现
Causal discovery for observational sciences using supervised machine learning
论文作者
论文摘要
因果推断可以估计因果关系,但是除非通过实验收集数据,否则统计分析必须依赖于预先指定的因果模型。因果发现算法是用于从数据构建此类因果模型的经验方法。 几种渐近正确的方法已经存在,但是它们通常在较小的样本上挣扎。此外,大多数方法都集中在非常稀疏的因果模型上,这可能并不总是是现实数据生成机制的现实表示。最后,尽管这些方法提出的因果关系通常是正确的,但他们对因果非相关性的主张具有很高的错误率。对于观察科学而言,这种非保守误差权衡并不理想,在观察科学中,所得模型直接用于为因果推理提供信息:许多因果关系缺失的因果模型需要太强的假设,并且可能导致偏见效应估计。 我们提出了一种解决这三个缺点的新因果发现方法:监督学习发现(SLDISCO)。 SLDISCO使用监督的机器学习来获取从观测数据到因果模型等效类别的映射。 我们根据高斯数据在一项大型仿真研究中评估SLDISCO,并考虑了几种模型大小和样本量的选择。我们发现,与现有程序相比,Sldisco更保守,对样本量的信息不足和敏感。 我们此外提供了真正的流行病学数据应用。我们使用随机的子采样来研究小样本上的真实数据性能,并再次发现Sldisco对样本量不太敏感,因此似乎更好地利用了小型数据集中可用的信息。
Causal inference can estimate causal effects, but unless data are collected experimentally, statistical analyses must rely on pre-specified causal models. Causal discovery algorithms are empirical methods for constructing such causal models from data. Several asymptotically correct methods already exist, but they generally struggle on smaller samples. Moreover, most methods focus on very sparse causal models, which may not always be a realistic representation of real-life data generating mechanisms. Finally, while causal relationships suggested by the methods often hold true, their claims about causal non-relatedness have high error rates. This non-conservative error tradeoff is not ideal for observational sciences, where the resulting model is directly used to inform causal inference: A causal model with many missing causal relations entails too strong assumptions and may lead to biased effect estimates. We propose a new causal discovery method that addresses these three shortcomings: Supervised learning discovery (SLdisco). SLdisco uses supervised machine learning to obtain a mapping from observational data to equivalence classes of causal models. We evaluate SLdisco in a large simulation study based on Gaussian data and we consider several choices of model size and sample size. We find that SLdisco is more conservative, only moderately less informative and less sensitive towards sample size than existing procedures. We furthermore provide a real epidemiological data application. We use random subsampling to investigate real data performance on small samples and again find that SLdisco is less sensitive towards sample size and hence seems to better utilize the information available in small datasets.