深度学习真的超过了生理时间序列临床预测的非深度机器学习吗？

论文标题

深度学习真的超过了生理时间序列临床预测的非深度机器学习吗？

Does Deep Learning REALLY Outperform Non-deep Machine Learning for Clinical Prediction on Physiological Time Series?

论文作者

Liao, Ke, Wang, Wei, Elibol, Armagan, Meng, Lingzhong, Zhao, Xu, Chong, Nak Young

论文摘要

机器学习已在医疗保健应用中广泛用于近似复杂模型，用于临床诊断，预后和治疗。由于深度学习具有从时间序列中提取信息的出色能力，因此尚未充分探索其在稀疏，不规则采样，多变量和不平衡生理数据上的真正功能。在本文中，我们根据EHR（尤其是生理时间序列）系统地检查了机器学习模型的临床预测任务的性能。我们选择Physionet 2019挑战公共数据集以预测ICU单元中的败血症结果。比较了十种基线机器学习模型，包括3种深度学习方法和7种非深度学习方法，这些方法通常用于临床预测领域。具有特定临床意义的九个评估指标用于评估模型的性能。此外，我们子样本训练数据集大小，并使用学习曲线拟合，以调查培训数据集大小对机器学习模型性能的影响。我们还提出了生理时间序列数据的一般预处理方法，并使用骰子丢失来处理数据集不平衡问题。结果表明，深度学习确实超过了非深度学习，但是在某些条件下：首先，通过一些特定的评估指标（AUROC，AUPRC，敏感性和FNR）进行评估，但没有其他；其次，训练数据集的大小足够大（估计数千个）。

Machine learning has been widely used in healthcare applications to approximate complex models, for clinical diagnosis, prognosis, and treatment. As deep learning has the outstanding ability to extract information from time series, its true capabilities on sparse, irregularly sampled, multivariate, and imbalanced physiological data are not yet fully explored. In this paper, we systematically examine the performance of machine learning models for the clinical prediction task based on the EHR, especially physiological time series. We choose Physionet 2019 challenge public dataset to predict Sepsis outcomes in ICU units. Ten baseline machine learning models are compared, including 3 deep learning methods and 7 non-deep learning methods, commonly used in the clinical prediction domain. Nine evaluation metrics with specific clinical implications are used to assess the performance of models. Besides, we sub-sample training dataset sizes and use learning curve fit to investigate the impact of the training dataset size on the performance of the machine learning models. We also propose the general pre-processing method for the physiology time-series data and use Dice Loss to deal with the dataset imbalanced problem. The results show that deep learning indeed outperforms non-deep learning, but with certain conditions: firstly, evaluating with some particular evaluation metrics (AUROC, AUPRC, Sensitivity, and FNR), but not others; secondly, the training dataset size is large enough (with an estimation of a magnitude of thousands).

下载PDF全文

下载文献需遵守相关版权规定

论文标题