论文标题
健康体育馆:用于开发增强学习算法的合成健康相关数据集
The Health Gym: Synthetic Health-Related Datasets for the Development of Reinforcement Learning Algorithms
论文作者
论文摘要
近年来,机器学习研究界从公开访问的基准数据集的可用性中受益匪浅。由于其高度机密的性质,通常无法公开获得临床数据。这阻碍了医疗保健中可再现且可普遍的机器学习应用的发展。在这里,我们介绍了健康体育馆 - 越来越多的高度现实的合成医疗数据集集合,可以免费访问原型,评估和比较机器学习算法,并特别关注增强学习。本文描述的三个合成数据集介绍了患者在重症监护病房中与急性低血压和败血症的同伴,并且患有人类免疫缺陷病毒(HIV)的患者接受了抗逆转录病毒疗法。数据集是使用新颖的生成对抗网络(GAN)创建的。变量的分布以及合成数据集随时间推移的变量和趋势之间的相关性反映了实际数据集中的变量和趋势。此外,与合成数据集的公共分布相关的敏感信息披露的风险估计非常低。
In recent years, the machine learning research community has benefited tremendously from the availability of openly accessible benchmark datasets. Clinical data are usually not openly available due to their highly confidential nature. This has hampered the development of reproducible and generalisable machine learning applications in health care. Here we introduce the Health Gym - a growing collection of highly realistic synthetic medical datasets that can be freely accessed to prototype, evaluate, and compare machine learning algorithms, with a specific focus on reinforcement learning. The three synthetic datasets described in this paper present patient cohorts with acute hypotension and sepsis in the intensive care unit, and people with human immunodeficiency virus (HIV) receiving antiretroviral therapy in ambulatory care. The datasets were created using a novel generative adversarial network (GAN). The distributions of variables, and correlations between variables and trends over time in the synthetic datasets mirror those in the real datasets. Furthermore, the risk of sensitive information disclosure associated with the public distribution of the synthetic datasets is estimated to be very low.