论文标题
分布式数据的协作因果推断
Collaborative causal inference on distributed data
论文作者
论文摘要
近年来,为分布式数据保存隐私的因果推断技术的开发引起了人们的关注。许多现有的分布式数据方法集中在解决缺乏受试者(样本)的方法上,并且只能减少估计治疗效果时的随机错误。在这项研究中,我们提出了一个数据协作准体验(DC-QE),该实验(DC-QE)解决了受试者和协变量的缺乏,从而减少了估计中的随机错误和偏见。我们的方法涉及从本地各方的私人数据中构建降低维度的中间表示,共享中间表示,而不是私人数据以保存隐私,从而估算了共享中间表示的倾向得分,最后,从倾向分数估算了治疗效果。通过对人工和现实世界数据的数值实验,我们确认我们的方法比单个分析可以得出更好的估计结果。虽然降低维度在私人数据中失去了一些信息并导致绩效降低,但我们观察到,与许多当事方共享中间表示以解决缺乏受试者和协变量足以改善性能,以克服降低维度引起的降级。尽管不一定保证外部有效性,但我们的结果表明DC-QE是一种有前途的方法。通过广泛使用我们的方法,可以将中间表示形式作为开放数据发布,以帮助研究人员找到因果关系并积累知识库。
In recent years, the development of technologies for causal inference with privacy preservation of distributed data has gained considerable attention. Many existing methods for distributed data focus on resolving the lack of subjects (samples) and can only reduce random errors in estimating treatment effects. In this study, we propose a data collaboration quasi-experiment (DC-QE) that resolves the lack of both subjects and covariates, reducing random errors and biases in the estimation. Our method involves constructing dimensionality-reduced intermediate representations from private data from local parties, sharing intermediate representations instead of private data for privacy preservation, estimating propensity scores from the shared intermediate representations, and finally, estimating the treatment effects from propensity scores. Through numerical experiments on both artificial and real-world data, we confirm that our method leads to better estimation results than individual analyses. While dimensionality reduction loses some information in the private data and causes performance degradation, we observe that sharing intermediate representations with many parties to resolve the lack of subjects and covariates sufficiently improves performance to overcome the degradation caused by dimensionality reduction. Although external validity is not necessarily guaranteed, our results suggest that DC-QE is a promising method. With the widespread use of our method, intermediate representations can be published as open data to help researchers find causalities and accumulate a knowledge base.