一个新颖的混合抽样框架，用于不平衡学习

论文标题

一个新颖的混合抽样框架，用于不平衡学习

A Novel Hybrid Sampling Framework for Imbalanced Learning

论文作者

Newaz, Asif, Haq, Farhan Shahriyar

论文摘要

类失衡是分类任务中经常发生的情况。从不平衡的数据中学习提出了一个重大挑战，这在这一领域促进了大量研究。使用采样技术进行数据预处理是处理数据中存在的不平衡的标准方法。由于标准分类算法在不平衡数据上的性能不佳，因此在培训之前，数据集需要足够平衡。这可以通过过度采样少数族裔班或不足以采样多数级别来实现。在这项研究中，已经提出了一种新型的混合采样算法。为了克服采样技术的局限性，同时确保了保留的采样数据集的质量，已经开发了一个复杂的框架来正确结合三种不同的采样技术。首先应用邻里清洁规则以减少失衡。然后从策略上与SMOTE算法结合随机采样，以在数据集中获得最佳平衡。该提出的混合方法学称为“ Smote-Rus-NC”，已与其他最先进的采样技术进行了比较。该策略进一步合并到集合学习框架中，以获得更健壮的分类算法，称为“ SRN-BRF”。对26个不平衡数据集进行了严格的实验，并具有不同程度的失衡。在几乎所有数据集中，提出的两种算法在许多情况下都超过了现有的采样策略，其差额很大。尤其是在流行抽样技术完全失败的高度不平衡的数据集中，他们实现了无与伦比的性能。获得的优越结果证明了所提出的模型的功效及其在不平衡域中具有强大采样算法的潜力。

Class imbalance is a frequently occurring scenario in classification tasks. Learning from imbalanced data poses a major challenge, which has instigated a lot of research in this area. Data preprocessing using sampling techniques is a standard approach to deal with the imbalance present in the data. Since standard classification algorithms do not perform well on imbalanced data, the dataset needs to be adequately balanced before training. This can be accomplished by oversampling the minority class or undersampling the majority class. In this study, a novel hybrid sampling algorithm has been proposed. To overcome the limitations of the sampling techniques while ensuring the quality of the retained sampled dataset, a sophisticated framework has been developed to properly combine three different sampling techniques. Neighborhood Cleaning rule is first applied to reduce the imbalance. Random undersampling is then strategically coupled with the SMOTE algorithm to obtain an optimal balance in the dataset. This proposed hybrid methodology, termed "SMOTE-RUS-NC", has been compared with other state-of-the-art sampling techniques. The strategy is further incorporated into the ensemble learning framework to obtain a more robust classification algorithm, termed "SRN-BRF". Rigorous experimentation has been conducted on 26 imbalanced datasets with varying degrees of imbalance. In virtually all datasets, the proposed two algorithms outperformed existing sampling strategies, in many cases by a substantial margin. Especially in highly imbalanced datasets where popular sampling techniques failed utterly, they achieved unparalleled performance. The superior results obtained demonstrate the efficacy of the proposed models and their potential to be powerful sampling algorithms in imbalanced domain.

下载PDF全文

下载文献需遵守相关版权规定

论文标题