论文标题
从其有限的统计特征上估算外部样品的模型性能
Estimating Model Performance on External Samples from Their Limited Statistical Characteristics
论文作者
论文摘要
解决数据转移的方法通常假设对多个数据集的完全访问。但是,在医疗保健领域中,保护隐私法规以及商业利益限制了数据可用性,因此,研究人员通常只能研究少数数据集。相比之下,特定患者样本的统计特征有限,更容易共享,并且可以从先前发表的文献或集中的协作工作中获得。 在这里,我们提出了一种方法,该方法可以从其有限的统计特征中估算外部样本中的模型性能。我们搜索诱发与外部统计数字的权重;并且最接近统一。然后,我们在加权内部样本上使用模型性能作为外部对方的估计。 我们评估了两种风险模型的模拟数据以及电子病历数据的拟议算法,预测了溃疡性结肠炎患者的并发症,并诊断出患有心房颤动的女性中风。在绝大多数情况下,估计的外部绩效比内部绩效更接近实际表现。我们提出的方法可能是训练强大模型并检测外部环境中潜在模型故障的重要组成部分。
Methods that address data shifts usually assume full access to multiple datasets. In the healthcare domain, however, privacy-preserving regulations as well as commercial interests limit data availability and, as a result, researchers can typically study only a small number of datasets. In contrast, limited statistical characteristics of specific patient samples are much easier to share and may be available from previously published literature or focused collaborative efforts. Here, we propose a method that estimates model performance in external samples from their limited statistical characteristics. We search for weights that induce internal statistics that are similar to the external ones; and that are closest to uniform. We then use model performance on the weighted internal sample as an estimation for the external counterpart. We evaluate the proposed algorithm on simulated data as well as electronic medical record data for two risk models, predicting complications in ulcerative colitis patients and stroke in women diagnosed with atrial fibrillation. In the vast majority of cases, the estimated external performance is much closer to the actual one than the internal performance. Our proposed method may be an important building block in training robust models and detecting potential model failures in external environments.