使用集合学习的底层采样来识别导致早产的因素

论文标题

使用集合学习的底层采样来识别导致早产的因素

Using Undersampling with Ensemble Learning to Identify Factors Contributing to Preterm Birth

论文作者

Dong, Shi, Feric, Zlatan, Li, Guangyu, Wu, Chieh, Gu, April Z., Dy, Jennifer, Meeker, John, Padilla, Ingrid Y., Cordero, Jose, Vega, Carmen Velez, Rosario, Zaira, Alshawabkeh, Akram, Kaeli, David

论文摘要

在本文中，我们提出了合奏学习模型，以确定导致早产的因素。我们的工作利用了由NIEHS P42中心收集的丰富数据集，该数据集试图确定负责北波多黎各早期出生率高的主要因素。我们研究了解决数据集中存在的两个主要挑战的分析模型：1）数据集中的大量不完整数据，以及2）数据集中的类不平衡。首先，我们利用并比较两种类型的缺少数据插补方法：1）基于平均值和2）基于相似性，从而提高了该数据集的完整性。其次，我们根据使用合奏学习的底样采样来解决数据集中存在的类不平衡，提出一个功能选择和评估模型。我们利用和比较多个集合特征选择方法，包括完整的线性聚合（CLA），加权平均聚集（WMA），特征出现频率（OFA）和基于分类精度的聚合（CAA）。为了进一步解决每个功能中存在的缺失数据，我们提出了两种新颖的方法：1）基于数据速率和基于准确性的聚合（MAA）和2）基于熵和基于准确性的聚合（EAA）。两个提出的模型平衡了在功能选择过程中丢失的数据处理引入的数据差异程度，同时保持模型性能。我们的结果表明，与以前的最新方法相比，灵敏度与辐射的提高42 \％。

In this paper, we propose Ensemble Learning models to identify factors contributing to preterm birth. Our work leverages a rich dataset collected by a NIEHS P42 Center that is trying to identify the dominant factors responsible for the high rate of premature births in northern Puerto Rico. We investigate analytical models addressing two major challenges present in the dataset: 1) the significant amount of incomplete data in the dataset, and 2) class imbalance in the dataset. First, we leverage and compare two types of missing data imputation methods: 1) mean-based and 2) similarity-based, increasing the completeness of this dataset. Second, we propose a feature selection and evaluation model based on using undersampling with Ensemble Learning to address class imbalance present in the dataset. We leverage and compare multiple Ensemble Feature selection methods, including Complete Linear Aggregation (CLA), Weighted Mean Aggregation (WMA), Feature Occurrence Frequency (OFA), and Classification Accuracy Based Aggregation (CAA). To further address missing data present in each feature, we propose two novel methods: 1) Missing Data Rate and Accuracy Based Aggregation (MAA), and 2) Entropy and Accuracy Based Aggregation (EAA). Both proposed models balance the degree of data variance introduced by the missing data handling during the feature selection process while maintaining model performance. Our results show a 42\% improvement in sensitivity versus fallout over previous state-of-the-art methods.

下载PDF全文

下载文献需遵守相关版权规定

论文标题