评估合成电子健康记录

论文标题

评估合成电子健康记录

Evaluation of the Synthetic Electronic Health Records

论文作者

Muller, Emily, Zheng, Xu, Hayes, Jer

论文摘要

由于其能够捕获复杂的基础数据分布，因此发现生成模型对数据综合有效。通常通过目视检查图像数据集或表格数据集的下游分析任务来评估这些模型中生成的数据的质量。这些评估方法既不衡量隐式数据分布，也不考虑数据隐私问题，它仍然是如何比较和对不同生成模型进行比较和排名的开放问题。医疗数据可能很敏感，因此在维持合成数据集的数据实用的同时，引起患者的隐私问题非常重要。除了公用事业评估之外，这项工作还概述了两个指标，称为合成数据集的样本评估，称为相似性和唯一性。我们通过几种最先进的生成模型来证明拟议的概念，用于合成囊性纤维化（CF）患者的电子健康记录（EHR），观察到所提出的指标适合合成数据评估和生成模型比较。

Generative models have been found effective for data synthesis due to their ability to capture complex underlying data distributions. The quality of generated data from these models is commonly evaluated by visual inspection for image datasets or downstream analytical tasks for tabular datasets. These evaluation methods neither measure the implicit data distribution nor consider the data privacy issues, and it remains an open question of how to compare and rank different generative models. Medical data can be sensitive, so it is of great importance to draw privacy concerns of patients while maintaining the data utility of the synthetic dataset. Beyond the utility evaluation, this work outlines two metrics called Similarity and Uniqueness for sample-wise assessment of synthetic datasets. We demonstrate the proposed notions with several state-of-the-art generative models to synthesise Cystic Fibrosis (CF) patients' electronic health records (EHRs), observing that the proposed metrics are suitable for synthetic data evaluation and generative model comparison.

下载PDF全文

下载文献需遵守相关版权规定

论文标题